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In This Issue 


This issue of Survey Methodology contains the second in an annual invited paper series in honour 
of Joseph Waksberg. A brief description of the series and a short biography of Joseph Waksberg 
were given in the June 2001 issue of the journal. The author of the Waksberg Invited Paper for 2002 
is Wayne Fuller. I would like to thank the members of the Committee, Graham Kalton (chair), Chris 
Skinner, David Binder and Paul Biemer, for having chosen such a distinguished statistician, who 
has made profound contributions to many areas of statistical theory and practice, as the author of the 
second paper in the Waksberg Invited Paper Series. 

In his paper entitled “Regression Estimation for Survey Samples” Wayne Fuller presents a broad 
overview of historical and recent developments in the use of regression models in surveys for 
estimation, weight calibration and non-response adjustment. After a brief introduction and historical 
background, he discusses the use of regression models for estimation in complex surveys from a 
design based perspective. He follows this with an exploration of the model based perspective. Other 
topics discussed are the use of regression models for multinomial data, techniques available when 
auxiliary variables are available for every unit of the population, and regression to account for the 
effects of non-response in surveys. Finally, consideration of a few practical aspects of applications 
rounds out this insightful overview of an important area of inference from survey data to which 
Wayne Fuller himself has made many important contributions. 

This issue also contains a special section “Remembering Leslie Kish” which includes four papers, 
one by Leslie Kish himself containing some of his last thoughts on the topics of combining samples 
and surveys. Two of the other papers discuss implementations of Leslie Kish’s idea of rolling 
censuses. These two papers were also presented at the Statistics Canada Symposium 2001 in a 
special session entitled “Remembering Leslie Kish”. 

The first paper in the special section, by Graham Kalton, presents an inspiring overview of Kish’s 
contributions to many areas of statistics. Many of the problems that Kish worked on are put into 
historical perspective and their practical importance is emphasized. 

The paper by Kish presents ideas that he was still working on at the time of his death in October 
2000. I am grateful to Graham Kalton and Jack Gambino for making editorial corrections to the 
paper, but it is presented largely as it was at the time of Kish’s death. In this paper he argues that, 
just as statistics represented a new paradigm in the scientific method, and survey sampling required 
a new paradigm in statistics, so rolling samples and multi-population surveys require new paradigms 
in survey methods. We can only speculate as to what the final paper would have been like had Kish 
lived. 

Alexander describes the American Community Survey, planned to be introduced by the U.S. 
Census Bureau in coming years as a replacement for the decennial census long form. This is a very 
large survey based very much on the idea of rolling samples and censuses that Kish introduced more 
than twenty years ago. This paper discusses the concepts, frame, sampling design, and cumulation 
of samples and weighting. 

The final paper in the special section, by Durr and Dumais, describes the new rolling census being 
introduced in France to replace their more traditional census. In this rolling census, every small 
commune will be surveyed once within a five year period; larger communes will be divided into five 
rotation groups, each rotation group being surveyed in one of the five years. This paper describes 
objectives, design and estimation procedures for the rolling census. 

In their article, Cahill and Chen develop an approach to exploit data from multiple surveys and 
epochs by benchmarking the parameter estimates of logit models of binary choice and semi- 
parametric survival models. Estimates obtained from a survey rich in explanatory variables are 
benchmarked to information from a survey with significant historical depth. Cahill and Chen 
demonstrate how the method can be applied, using the maternity leave module of the LifePaths 
dynamic microsimulation project at Statistics Canada. 


In This Issue 


Garren and Chang consider the problem of the non-telephone population in telephone surveys 
using random digit dialing. Using Public Use Microdata Samples, the propensity that a household 
owns a phone is estimated using generalized linear regression and is used during estimation. 
Asymptotic biases and variances are presented for both the non-poststratified and poststratified 
estimators incorporating and not incorporating the estimated propensity. These four estimators are 
further compared through a simulation study. 

The article by Tillé develops an estimator that can be used to avoid the problem of empty 
post-strata that can occurs with the usual post-stratified estimator. The idea involves using a 
conditionally weighted estimator and conditioning on ranks in the population of an auxiliary variable 
known for all units of this population. In this way, the sizes of the post-strata are set in the sample 
and random in the population. The next step is to calculate the mean of the conditionally weighted 
estimators to obtain greater stability. The estimator obtained is calibrated on distribution, linear and 
exactly unbiased. A simulation study is used to show that the proposed estimator is more robust than 
the generalized regression estimator when the relation of the variable of interest and the auxiliary 
variable is not linear. Lastly, the article proposes an approximate estimator of the variance verified 
using simulations. 

Shao and Butani consider the problem of estimating variances for imputed survey estimators. 
They show that the resulting variances can be estimated in two parts, the first of which can be 
estimated using a grouped half-sample method that incorporates adjustments to take imputation into 
account. As the estimation of the second part may entail many derivations, Shao and Butani propose 
an adjustment to the grouped half-sample method that leads to approximately unbiased variance 
estimates. 

In his paper Cohen describes a method to implement Rao and Shao’s jackknife method of 
estimating variances to account for imputation using replicate weights. Rao and Shao’s method 
involves calculation, for each jackknife replicate, adjusted values of imputed data points. The 
method can be used with either mean imputation or hot deck imputation. Cohen’s method involves 
adding extra rows to the replicate weight file. For each imputed value, one extra row is added for 
each respondent in the same imputation class. 

In the last paper of this issue, Valliant studies several variance estimators for the General 
Regression (GREG) estimator. The interest is in finding variance estimators that, under certain 
conditions, are approximately unbiased for both the design-variance and the model-variance even 
if the model that motivates the GREG has an incorrect variance parameter. A key feature of these 
robust estimators is the adjustment of squared residuals by factors analogous to the leverages used 
in standard regression analysis. It is shown that the delete-one jackknife implicitly includes the 
leverage adjustments and is a good choice from either the design-based or model-based perspective. 
A simulation study shows that these variance estimators have small bias and produce confidence 
intervals with near-nominal coverage rates. 
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Waksberg Invited Paper Series 


Survey Methodology has established an annual invited paper series in honor of Joseph Waksberg, who has 
made many important contributions to survey methodology. Each year, a prominent survey researcher will 
be chosen to author a paper that will review the development and current state of a significant topic in the 
field of survey methodology. The author receives a cash award, made possible through a grant from Westat 
in recognition of Joe Waksberg’s contributions during his many years of association with Westat. The 
grant is administered financially and managed by the American Statistical Association. The author of the 
paper is selected by a four-person committee appointed by Survey Methodology and the American 
Statistical Association. 


JOSEPH WASKBERG 


2002 WAKSBERG INVITED PAPER 
Author : Wayne A. Fuller 


Wayne A. Fuller is Emeritus Distinguished Professor in Statistics and Economics at Iowa State University. 
He has published approximately 100 articles in more than twenty journals and is author of the texts 
Introduction to Statistical Time Series and Measurement Error Models. As a member of the Survey Group 
at Iowa State University, he had primary responsibility for developing estimation procedures for a large 
longitudinal national survey called the U.S. National Resources Inventory. His research interests in survey 
sampling include regression estimation, small area estimation, imputation, and multiple phase sampling. 
He currently chairs the Advisory Committee on Statistical Methods of Statistics Canada. 
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MEMBERS OF THE WASKBERG PAPER SELECTION COMMITTEE (2002-2003) 


David A. Binder (Chair), Statistics Canada 

J. Michael Brick, Westat, Inc. 

David R. Bellhouse, University of Western, Ontario 
Paul Biemer, Research Triangle Institut, U.S.A. 


Past Chairs: 
Graham Kalton (1999 - 2001) 
Chris Skinner (2001 - 2002) 


Past Authors: 


Gad Nathan (2001) 


Nominations: 


Nominations of individuals to be considered as authors or suggestions for topics should 
be sent to the chair of the committee, D.A. Binder, at Statistics Canada, 3", floor R.H. 
Coats Bldg. Tunneys’ Pasture, Ottawa, Ontario, Canada, KIA OT6, by e-mail 
binderdav @statcan.ca or by fax (613) 951-5711. Nominations and suggestions for 
topics must be received by December 6, 2002. 
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Regression Estimation for Survey Samples 


WAYNE A. FULLER’ 


ABSTRACT 


Regression and regression related procedures have become common in survey estimation. We review the basic properties 
of regression estimators, discuss implementation of regression estimation, and investigate variance estimation for regression 
estimators. The role of models in constructing regression estimators and the use of regression in nonresponse adjustment 


are explored. 


KEY WORDS: Auxiliary information; Calibration; Least squares; Design consistency; Linear prediction. 


1. INTRODUCTION 


Design and estimation in survey sampling involve the 
use of information about the study population to construct 
efficient procedures. While design and estimation are 
intimately related, with estimators depending on the design, 
the two topics are often treated somewhat separately in the 
survey sampling literature. We follow tradition first 
studying estimation treating the design as given. The 
estimation task is to combine the available information 
about the population, with the sample data to produce good 
representations of characteristics of interest. 

Regression estimation is one of the important procedures 
that use population information or information from a larger 
sample, to construct estimators with good efficiency. The 
information, sometimes called auxiliary information, may 
have been used in the design or may not have been 
available at the design stage. In surveys of the human 
population, the information often comes from official 
sources such as the national census. Similar sources may 
provide information for other types of surveys. For 
example, in a survey of land use the total surface area, the 
area owned by the national government, and the area in 
permanent water bodies may be available from national data 
archives. 

Three distinct situations can be identified with respect to 
the nature of the auxiliary information that is available. In 
the first, the values of the auxiliary vector x are known for 
each element in the population at the time of sample 
selection. In this case the auxiliary variable can be used in 
designing the sample selection procedure. 

In the second situation all values of the vector x are 
known, but a particular value cannot be associated with a 
particular element until the sample is observed. In this case, 
the auxiliary information cannot be used in design, but a 
wide range of estimation options are available once the 
observations are available. For example, the population 
census may give the age-sex distribution of the population, 
but a list of individuals and their characteristics is not 


available to non governmental institutions selecting 
samples. 

In the third situation, only the population mean of x is 
known, or known for a large sample. In this case, the 
auxiliary information cannot be used in design and the 
estimation options are limited. For example the U.S. 
Department of Agriculture might release an estimate of the 
total number of animals of a particular type on farms on a 
particular date. Our discussion concentrates on_ this 
situation. 

Two estimation situations can also be identified. In one, 
a single variable and a parameter, or a very small number of 
parameters, is under consideration. The analyst is willing to 
invest a great deal of effort in the analysis, has a well 
formulated population model, and is prepared to support the 
estimation procedure on the basis of the reasonableness of 
the model. In the second situation, a large number of 
analyses of a large number of variables is anticipated. No 
single model is judged adequate for all variables. The 
prototypical example of the second situation is the case in 
which a data set is prepared by the survey sampler to be 
analyzed by others. Because the person preparing the data 
set does not have knowledge of the analysis variables, 
emphasis is placed on the use of estimators that can be 
defended with minimal recourse to models. 

Regression estimators fall in the class of linear esti- 
mators. Linear estimators have a particular advantage in 
survey sampling because once the weights are calculated 
they are appropriate for any analysis variable. Several 
properties of estimators will be examined in our discussion. 
Given a model, we accept the classical goal of minimizing 
the mean square error in a class of estimators. That class 
may be the class of linear estimators that are unbiased under 
the model, but the class may be further restricted. 

Estimators that are scale and location invariant can be 
used in general settings. Mickey (1959) suggested that the 
term regression estimator be restricted to linear estimators 
that are location and scale invariant. While we may not 
adhere strictly to this definition, we support the distinction 
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between estimators that are location and scale invariant and 
those that are not. We consider location invariance to be 
important for sampling designs where the unit of interest for 
analysis is also the sampling unit. For cluster and two stage 
designs in which weights are constructed for primary 
sampling unit totals, location invariance is less important. 

Models play an important role in the construction of 
regression estimators. It is desirable that the estimators 
retain good properties if the model specification is not 
exact. Therefore properties conditional on the realized 
finite population, as well as properties under the model, are 
important. 

Linear estimators that reproduce the known means of the 
auxiliary variables are said to be calibrated. This is a desir- 
able property in that, for example, the marginals of tables 
with an auxiliary variable as an analysis variable agree with 
known totals. If the auxiliary variable is of no analytic 
interest, then calibration is less important. 


2. BACKGROUND 


The earliest references to the use of regression in survey 
sampling include Jessen (1942) and Cochran (1942). 
Regression in similar contexts would certainly have been 
used earlier and Cochran (1977, page 189) mentions a 
regression on leaf area by Watson (1937). It is interesting 
that Jessen’s use of regression was essentially composite 
estimation where regression was used to improve estimates 
for two time points given samples at each point with some 
common elements in the two samples. Cochran (1942) 
gave the basic theory for regression in survey sampling 
relying heavily on linear model theory. He showed that the 
linear model did not need to hold in order for the regression 
estimator to perform well. He derived an expression for the 
O(n ~') bias and an O(n 7) approximation for the variance. 
He also showed that for the model with regression passing 
through the origin and error variances proportional to x, the 
ratio estimator is the generalized least squares estimator. 

Regression estimation attracted theoretical interest in the 
1950’s, often in the form of studies of the bias. See Mickey 
(1959). Brewer (1963) is an early reference that considers 
linear estimation using a superpopulation model to 
determine an optimal procedure. He was concerned with 
finding the optimal design for the ratio estimator and 
discussed the possible conflict between an optimal design 
under the model and a design that is less model dependent. 
See also Brewer (1979). Royall (1970) argued for the use of 
models, that the conditional properties that are important 
are those conditional on the auxiliary information in the 
sample, and that the design should be chosen to optimize 
those properties. Royall and his coworkers, e.g., Royall and 
Cumberland (1981), studied the conditional properties of 
regression estimators, conditional on the realized sample of 
auxiliary variables. 
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A great deal of research was conducted in the 1970’s and 
1980’s on the general nature of the regression estimator in 
survey samples and on the degree to which the model 
prediction approach can be reconciled with the design 
perspective. Fuller (1973, 1975) gave the large sample 
properties of a vector of regression coefficients computed 
from a survey sample. Isaki (1970) studied regression 
estimators and the results were published in expanded 
versions in Isaki and Fuller (1982) and Fuller and Isaki 
(1981). It was shown that a regression estimator constructed 
under a model is design consistent for the population mean 
if the model contains certain variables. Cassel, Sarndal and 
Wretman (1976) considered both model and design 
principles in estimator construction and suggested the term 
“generalized regression estimator” for design consistent 
estimators of the total of the form 


T, creo = Jyur * Cy 7 Tur) B> 

where Lens and fone are the Horvitz-Thompson 
estimators of the totals of y and x, respectively, T, ,, is the 
know population total of x and B is an estimated regression 
coefficient. Sarndal (1980), Wright (1983), and Sarndal 
and Wright (1984) discussed classes of regression 
estimators. The text by Saérndal, Swensson and Wretman 
(1992) contains an extensive discussion of regression 
estimation and Mukhopadhyay (1993) is a review. 

It was the 1970’s before the use of regression for general 
purpose, multiple characteristic, surveys appeared and it 
was the 1990’s before the use of regression weighting could 
be called widespread. An early use of regression weights 
was at Doane Agricultural Services Inc., now Doane 
Marketing Research. During 1971-1972 a readership study 
of farmers was conducted under the direction of Mr. John 
Wilkin in which 6,920 farmers responded. Weights for the 
respondents were constructed using regression procedures, 
where the controls came from the U.S. Agricultural Census 
and from Department of Agriculture sources. Doane 
provided financial support to Iowa State University to 
develop a regression weight generation program. To 
guarantee positive weights in the Doane study, observations 
with small weights were grouped and assigned a common 
weight. Grouping continued until the common weight was 
positive. Later computer programs used modifications of 
the Huang and Fuller (1978) procedure to guarantee 
positive weights. Doane has used regression weights for 
their syndicated market research studies since 1972. 

Regression estimation was first used at Statistics Canada 
in 1988 for the Canadian Labour Force Survey. In 1992 
regression estimation was used by the 1991 Canadian 
Census of Population to ensure that the weighted sum of 
variables collected via the long form (a one in five 
systematic sample of all households in Canada) was equal 
to known household and population totals as collected in 
the 1991 Census. See Bankier, Rathwell and Majkowski 
(1992) and Bankier, Houle and Luc (1997). The regression 
estimator is also the key component of the Generalized 
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Estimation System (GES) developed at Statistics Canada 
and used in numerous business and social surveys since its 
release in 1992. The methodology is described in Estevao, 
Hidiroglou and Sdrndal (1995). See also Hidiroglou, 
Sarndal and Binder (1995). Regression estimation is now 
used to construct composite estimators for the Canadian 
Labour Force Survey. See Singh, Kennedy and Wu (2001), 
Gambino, Kennedy and Singh (2001) and Fuller and Rao 
(2001). 

Bethlehem and Keller (1987) report on the use of 
regression estimation at the Netherlands Central Bureau of 
Statistics (now Statistics Netherlands) in a program called 
LIN WEIGHT. Nieuwenbrock, Renssen and Hofman 
(2000) describe the software package Bascula, that has 
replaced LIN WEIGHT. Deville, Sarndal and Sautory 
(1993) describe a computer program CALMAR developed 
at Institut National de la Statistique et des Etudes 
Economiques (I. N. S. E. E.) that computes weights of the 
regression type with options for different objective 
functions. A program developed at Statistics Sweden and 
called CLAN97 is documented in Anderson and Nordberg 
(1998). Folsom and Singh (2000) discuss a procedure 
developed at the Research Triangle Institute. 


3. THE CLASSICAL LINEAR MODEL 


The classical linear model is the foundation for survey 
regression estimation, but the survey situation requires 
certain adaptations. To introduce regression estimation for 
survey samples, we review the classical linear model. 
Assume 


Vax pare ty 2x7 Tt, 


2 
eae NIKO G2), (3.1) 
where e, is independent of the k-dimensional row vectors x. 
for all i and j, and B is the unknown parameter column 
vector. We will also use matrix representations for the 
sample at Thus, for a sample of n elements, 


= = (x, x’) and y’ Na Vas nh) 


1? Xe: a) 
Given a sample of size n and treating the x, as.fixed, the 
best (minimum mean squared error) estimator of B is 


=(Sexix} Dx 2 XK G2) 
icA icA 

where A is the set of indexes of the sample elements and we 
assume, aS we will throughout, that the matrix to be 
inverted is nonsingular. If the e, are not normally distri- 
buted, B is the estimator with smallest variance in the class 
of linear unbiased estimators. The estimator of a linear 
combination of the coefficients, say 9, = Yate aB;, can be 


4 1a at 
written as 
i ey. Wai yi 
ieA 


where the weights, w 


we 


icA 


minimize the Lagrangean 


Gite 


EA(E waas-g) 


IEA 


and the Aj are Lagrange multipliers. The variance of 6. is 


V{6,} = ap? wait = 


iEA 


Wai o, 
ic€A 


because the weights are functions of the x, and not of y,. 
The covariance matrix of B is 


V{B} » a) ‘ pe bi} xix) i 


I 


icA 1cA icA 


HE 


icA 


(3.3) 


where bi =x’e, and c,=(X’X)'x/e,. 
independent of x; for all i and J, 


ee ale am 
1cA icA icA 


and we obtain the familiar expression, 


V{p} = ( > xix)" 0. 


icA 


Because eh is 


The usual unbiased estimator of the covariance matrix of B 
is obtained by replacing Oo. with the unbiased estimator of o, 
obtained as the mean square of the residuals, é, = y, - x; B. 
An estimator of the covariance matrix that Eta the 
V{¥-,b;} directly is 


icA icA icA 
= 2 &€;, (3.4) 
icA 
where bas =x, é, and ¢, = =(X Xa x / é,. In the same way 
ars nD 
V(0,) = 0 Was (3.5) 


is a linear combination of the elements of (3.4) and is a 
consistent estimator of Ven g ai Lhe estimator (3.4) is a 
consistent estimator of V{B} when the covariance matrix 
of the e, is a diagonal matrix with bounded elements. Thus 
it is a more robust estimator. However, the estimator (3.4) 
is biased downward because the variance of é, is usually 
less than the variance of e,. Two methods are available for 
reducing the bias. The first is to make a SOMES -of-freedom 
adjustment by multiplying V, { B} by (n - k)'n, where k 
is the dimension of x,. An ALHANE adjustment is to 
replace é, with 


r (1 - We) es, 
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where y,, is the i-th diagonal element of X (X’ X)! X’. See 
Horn, Horn and Duncan (1975), Royall and Cumberland 
(1978) and Cook and Weisberg (1982, section 2.2). 

If we observe the value x, for an element, but do not 
observe y,., then the best predictor of y, for that element is 
Vie x . Likewise, if we know the sum of x, for a set of 
x’s, then the best predictor for the sum of the y, is the sum 
of x, B. Thus, given a set of N elements that satisfy model 
(3.1), a set of observations (y,, x;) on a subset denoted by 
A, and the known values of x, for the remaining N-n 
elements, 


Vy Saree > >» y; = bs xB, 
icA icA 

where A is the set of elements for which y is not observed, 
is the best predictor of the sum of the unobserved y’s. See 
Goldberger (1962), Brewer (1963), Royall (1970), Harville 
(1976) and Graybill (1976, section 12.2). Hence 

es > » yi H Vy oe reg (3.6) 

icA 

is the best predictor for the total of N observations. 

If the first element in the x-vector is always one, we can 
partition the x-vector as x, = (1, X, ;) and write the 
regression estimator of the mean as 

Des SN ES A =y,* (X,y7 X,,) B,, (3.7) 
where B of (3.2) is partitioned as (By; B,)’ and (y,, X,,) is 
the vector of simple sample means. We call x,,B the 
regression estimator of the mean. 

Given the model (3.1), the expected value of the mean of 
y for the finite population of N elements generated by the 
model is x,,B and x,,B is an unbiased estimator of the 
finite population mean. This, we believe, is the point at 
which regression estimation for the finite population mean 
under more complex designs begins. 


4. DESIGN BASED ESTIMATION 


The development of this section treats the finite 
population as a sample realization from an infinite popu- 
lation. The use of such models has a long history in survey 
sampling. Some references through 1970 are Cochran 
(1939, 1942, 1946), Deming and Stephan (1941), Madow 
and Madow (1944), Yates (1949), Godambe (1955), Hajek 
(1959), Rao, Hartley, and Cochran (1962), Konijn (1962), 
Brewer (1963), Godambe and Joshi (1965), Hanurav 
(1966), Ericson (1969), Isaki (1970), and Royall (1970). 

To discuss the large sample properties of regression 
estimators we consider sequences of finite populations and 
associated probability samples. The set of indices of the 
elements in the Nth finite population is OT peaks a INTs 
where N = 1, 2,---. Associated with the ith element of the 
Nth population is a row vector of characteristics 
Ziy = Vine Xiy)- Let 
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be the set of vectors for the N-th finite population. The 
subscript N on the vectors will often be omitted. The finite 
population mean is 
N 
25 tel so? ea ned » (yj X;)- (4.1) 
i=1 
We denote the set of indices appearing in the sample 
selected from the Nth finite population by A,,. 

When the finite population is a sample from an infinite 
superpopulation, the probability properties of a sample are 
determined by the properties of the superpopulation and the 
properties of the probability mechanism used to select the 
sample. One can consider the unconditional properties, the 
properties conditional on the particular finite population, or 
the properties conditional on some part of the realized 
sample. 

Properties conditional on the finite population depend 
primarily on the survey design and are often called design 
properties. Thus an estimator 6 is said to be design consistent 
for the finite population parameter 0, if, for all ¢ > 0, 


ae prob {| O60, > £0 Fy} = 0, 
where the notation means that we condition on the realized 
finite population F,, and, hence, the probability is with 
respect to the design. 

Assume the finite population is generated as independent 
selections from a superpopulation for which E{z‘z,} is 
positive definite, where z, = (y,,x,). We define a super- 
population vector of least squares regression coefficients by 


P| 2x, eis (4.2) 


Given a sample of n observations on z; we define the 
nx (k+1) matrix Z = (y, X) of observations, where the ith 
row of Zis (y,,x;). If we assume the model 


y = XB+u, (4.3) 
E{u,uu’} = (0,®), 
the generalized least squares estimator of B is 
B - (x’@!x)'x’@'y. (4.4) 


The model (4.3) serves as motivation for estimators of the 
form (4.4) but we shall consider estimators where ® is a 
general symmetric positive definite weight matrix, not 
necessarily the covariance matrix of the errors. 

We give the large sample properties of the vector of 
estimated regression coefficients (4.4) following Fuller 
(1975). See also Hidiroglou (1974), Scott and Wu (1981), 
and Robinson and Sarndal (1983). 

Assume the superpopulation has eighth moments and 
that the sample design is such that the error in the Horvitz- 
Thompson estimator of the mean is O,(n ~12) where the 
Horvitz-Thompson estimator of the mean is 
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Neo Gere 


tceA 


N| 


t= Qup Xyr) = (4.5) 


and 7, is the selection probability for element 7. Then the 
error in the vector of regression coefficients is 


B-By|Fy = Quy bir + O, (n“), (4.6) 
where 
p= O1 ON; (4.7) 
(Quin Qo) A E\(Q,,,0,,) Ey} (4.8) 
(Q,,,Q,,) = 2 1 (X'®1X,X'@'y), 
ages mch Per aha ie (4.9) 


icA 


bj=n"! Nit... .€,=y.— XB, and.¢.is column: of 
X'@!, By (4. 9) the error in the estimator of By is 
approximately the error in a Horvitz-Thompson estimator 
of the mean. In result (4.6), the B,, is defined as a function 
of the expected values of the sample quantities (Q,,,Q,,). 
Thus B , is not necessarily the ordinary least squares finite 
population regression coefficient. The vector b, of (4.9) is 
the generalization of the vector b, of (3.3). If the limiting 
distribution of the properly standardized Horvitz-Thompson 
estimator is normal, and if there is a design consistent esti- 
mator of the variance of the Horvitz-Thompson estimator, 
then it is possible to construct tests and confidence intervals 
for the coefficients. Assume the design is such that 


Ya 


Nae (4.10) 


B L 

Rep Fag. Gan NO, 1), 
as N,n->o, where V— is the covariance matrix of 
Aig reall Ve a RSH O(n”) and the estimator V. is 


consistent for Vv. then 


Lv {B})* (8 -B,) 1 Fy 


(4.11) 


ie 


N(O,1), 


where 


Vip\ =O. V0... = Vics. (4.12) 
ee = V{b! ur} is the estimated eeslEpeys variance of ae 
calculated with b/=n!Na,Cé, é=y,-x,B, and 

Vie Cur) is the estimated design variance fe om calculated 
with ¢/ = Oe bi. The limiting properties hold for stratified 
samples and for Stratified two stage samples under mild 
restrictions on the sequence of populations. 

By analogy to (3.7), a regression estimator of the finite 
population mean is obtained by evaluating the estimated 
regression function at the population mean of x to obtain 


A 


Veep = XyB, (4.13) 


where B is of the form (4.4) with a general ® matrix. The 
estimator can be written as w’y, where the vector of 
weights can be constructed by minimizing the Lagrangean 


w Dw+(w X-Xx,)A 


and 2 is the vector of Lagrange multipliers. 
If there is acolumn vectors y such that 
Xy = @D,'J (4.14) 

for all possible samples, where D, = diag (z,, 1, --, 1) and 
J is an n-dimensional column Vector of cee stich the 
regression estimator X ,, B of (4.13) with B defined in (4.4) 


is a design consistent estimator of y,,. It follows from 
(4.11) that 


LV {Bh xs, |" (x8 - jy) NO, 1). 


The requirement of (4.14) that ® D. J be in the column 
space of X is crucial for design consistency. Simple ways to 
satisfy this requirement are to let one column of X be the 
column of ones and to use a multiple of D, as ®, or to let 
one column of X be the elements 7, and set ® =I, or to let 
one column of X be the elements 1, and set ® = D_. If Xis 
composed of the single column vector with elements mT, and 
if ® = D:, then the estimator (4.13) reduces to the Horvitz- 
Thomson estimator of (4.5) for fixed size designs. If 
X = J and ® = D_, the estimator (4.13) reduces to the ratio 


estimator, 
ae =| 
Vx - ( ba Tj 


(4.15) 


cay ee 


teA 


(4.16) 
icA 

which is location and scale invariant. 
To see the nature of the estimator when (4.14) is 
satisfied, let, with no loss of generality, X = (xX), X, ), where 


Xo aoD; ‘J and X 5 j = %j>X Li: Then 
a 5 Om 
Vreg = Xo,w*0,n Yat in Fon * ex pee. 
where 
By = |(%, ey Ui) et oe (X, - Xo f,,) a 
KG es XH) Dey 


fi.) = Xo, X,,> and (¥,,X,) is defined in (4.16). The 
ratios, such as Mi, y, > can also be written as ratios of 
Horvitz-Thompson estimators. If J is in the column space 


of X, estimator (4.17) is location invariant. If ® = D_, then 
eae Xy y = 1,and 
ee XB = Vex xX, .) B,; (4.18) 


10 
ee ee 4 
as (x, ,-X,) % (%,; 7 %,,,) 
icA 


Keys) (X) 57 


teA 


9) % (We -Fy): siti, 


Also, when ®=D_, 
regression coefficient 


By =HL 


ieU 


the B,, of (4.7) is the population 


xx} YS x’y. 
y iM (4.20) 
Because the regression estimator of the mean is a linear 
combination of regression coefficients, it is a regression 
coefficient for a linear combination of the original x- 
variables. To see this, let x, = (X15 X14) =(1,x,;), and 
define a new vector with one in the first position and a 
second vector with population mean equal to zero obtained 
by subtracting the original eo mean X, y from the 
original x, , vector. Let q, = (1,x, ; — X, y) Be the trans- 
formed vector. Then the Nee regression model is 


(4.21) 


Vier wl: Lr €;. 


where the finite population coefficient vector is 
= ! , ’ mt ! 
v = Qn? Br) -( Deut; a.) Dis. Nii ad 22) 
i¢U ieU 


The expression for the regression estimator of the mean 
becomes 


Vex = dy vy = ar (4.23) 


where ¥ is obtained from (4.4) with q, replacing x,. 
Because the estimator is a linear estimator of the form 
w’y, wecan write 


(4.24) 


= es 


icA icA 


where Ww, 27 g,;. Furthermore, the estimated variance 
from (4.12) is 


V (Fag) = V{to} = VIDE mi (8, 2}, 


ic€A 


(4.25) 


where it is understood that the estimated design variance of 
(4.25) is computed for the variable g,é,, é, = y, - x, B, and B 
is defined in (4.4) The variance estimator (4.25) is a direct 
generalization of expression (3.5). By transforming the 
variables so that the population mean of the auxiliary vector 
is zero, the first element of the regression vector is the 
regression estimator of the mean and the first element of 
(4.12) is an estimator of the variance of the regression 
estimator that contains a component due to estimating . 
This was pointed out in Hidiroglou, Fuller, and Hickman 
(1978). Also, see Sarndal (1982). Saérndal, Swensson and 
Wretman (1989) suggested the g-factor terminology for the 
calculation of the estimated variance of a regression 
estimated total. 
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From (4.17), we can write 
ss = = Lape = = fee 
Yreg  *0,n *0,n ba Se Biy Oe Xin B,y)| 
ti Os Utiane) 
= 2% Oy Gig), 


where e, = y, — x,B. Hence, the variance of the regression 
estimator can be estimated with 


Pe} = V(t)" 


(4.26) 


Map 1; é; } 
icA icA 
where é,=y, -x, sit Because (4.25) is as cath to compute 
as (4. 26), and is applicable when x, ,-X,j, is not 
O.(n~'”), the estimator (4.25) is wraith. 

The variance of the regression estimator can also be 
computed using the jackknife or other replication methods, 
and the use of replication methods is becoming more 
common. See Frankel (1971), Kish and Frankel (1974), 
Woodruff and Causey (1976), Royall and Cumberland 
(1978), and Duchesne (2000). Yung and Rao (1996) 
showed that (4.25) is identical to a jackknife linearization 
estimator for stratified multistage designs. 

The approach to regression estimation associated with 
(4.18) and (4.19) falls completely within a design formu- 
lation. No models of the population, beyond the existence 
of moments, are used, through one might argue that one 
would only consider regression when one feels there is 
some linear correlation between x, , and y,.. 

The estimator (4.19) is a very fatiral estimator because 
the estimated regression coefficient is a design consistent 
estimator of the population regression coefficient. It is 
mildly annoying that (4.18) does not always yield the 
smallest large sample design variance for the estimated 
mean. Treating B, of (4.18) as a fixed vector, the value that 
minimizes the variance of the linear combination of means 
is 
= (VR 41 Fy]] CO oFglFw} 2D 
See Cochran (1977, page 201), Fuller and Isaki (1981), 
Montanari (1987, 1999) and Rao (1994). If there is a design 
consistent estimator of the variance of x, ,, then the B, , 
that minimizes the estimated variance 


Vi ¥, ~ X, By 4}, 


denoted by B, dopt? is a consistent estimator of B, pee It 
follows that the estimator 


By dopt 


(4.28) 


Ya era (4.29) 


+ (X) y 7X, By, dopt 


has the minimum limit variance for design consistent 
estimators of the form y, + (X, , - X, ,)B, 7. Also 


A Seta s ” L 
v4 2.) : Cy came) a N(0, jig (4.30) 
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where V{ é,} is the estimator of (4.26) constructed with 
€, = Vi Vn ~ (X15 ~ X12) Bs dope 

In a large sample sense, (4.29) answers the question of 
how to construct a regression estimator with optimum 
design properties. In practice a number of questions remain. 
The estimator is obtained under the assumption of a large 
sample and a vector x of fixed dimension. In practice there 
may be a number of potential auxiliary variables and if a 
large number are included in the regression, terms excluded 
in the large sample approximation become important. This 
is particularly true for cluster samples where the number of 
primary sampling units in the sample is small. In such cases, 
the number of degrees-of-freedom in V{X, ~} iS small and 
the inverse can be unstable. These issues are discussed 
further in section 9. 

The estimator B, aoe of (4.29) is linear in y for most 
designs. See Rao (1994). For example, for a stratified 
design with simple random sampling within strata, 


Ct Xe Mah 
H Mh 
SUCK DENT Mn) on hd (4.31) 
h=1 j=l 
where 
Kea, (of), = Wy on, 
SN Gat Clary, yor Adu ris 
N"'N, = W,,N, is the size of stratum h, f, =, = N, Ny 


and n, is the sample size in stratum h. It follows that the 
weights associated with estimator (4.29) are 


a Ys ee = 
Gy XK eX 


Tp, 


H 
a 5 = 
%s » KD (X ie x, ,) (Xx . x, ) 
t= J= 


x K(X) 4) ~X1,)'- (4.32) 


See also Sarndal (1996). The weights of (4.32) can be 
constructed by minimizing Y,,,.,,,K, Subject to the 
constraints 

Dr NEN, wert 2, eH, 


icA, 


and 


De Wri Xi ne ~ Xn 
hiceA 


where A, is the set of sample elements in stratum h. 

The estimator of (4.19) with ® = D_ is a function of 
Horvitz-Thompson estimators of population moments. The 
estimator (4.17) with @! = diag {K,}, the diagonal matrix 
with K, on the diagonal for elements in stratum ¢, and 
dummy variables for stratum effects, gives the estimator of 
the mean in the class 
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er = ye + (X) y ~ x, ,) B, 


with the smallest estimated design variance. If the true 
slopes in the strata are the same and if the selection proba- 
bilities are proportional to the square roots of the within- 
stratum variances, then the use of ® = D. gives a smaller 
small sample MSE than the use of 7! =diag{K,} 
because the sum of w,,6 o; is smaller. Fuller and Isaki 
(1981) noted that the design-optimum estimator is often 
well approximated by the estimator constructed with 
® = D.. 

We have introduced regression estimation for the mean, 
but it is often the totals that are estimated and totals that are 
used as controls. Consider the regression estimator of the 
total of y defined by 


vi reg i USHA te dba. (4.33) 
where T, , is the iw total of x and (T pated Bn 6 


vector of design consistent estimators of (T, y id Y By 
analogy to (4.28), the estimator of the auapatiinn B is 


Bs, Y: [Via HP oars laste 


where V{T 
X71 
variance of jy 


(4.34) 


} is a design consistent estimator of the 
, and C{ joy ‘De a} isa design consistent 
estimator of the covariance af tT , and sf 
The estimator of the total is Ny yu for simple random 
sampling, but the exact equivalence may not hold in more 
complicated samples, because in such situations the esti- 
mated mean may be a ratio estimator. However, if the 
regression estimator of the two totals is constructed using 
(4.34), the ratio of the two estimated totals has large sample 
variance equal to that of the regression estimator of the 
mean. To see this write the error in the regression estimated 
totals of y and u as 
Lock ap co: ReaD 


y, reg y,N y, y,N 


ETT |B O,(Nn ) 
and 
u,reg uN u, ™ Tw 


+(Toy- Ty) Buy t O,(Nn), (4.35) 


where we are assuming fe = I allies ~ B... y and the 
corresponding quantities for u, to be 0,(Na- ") and 


21/2 ; 
, (n~*), respectively. Then the error in 7), ide Lanies 1S 
=1 


aah Lass i Berg Di 5 al abe T -T,y) 
-R, (P47 Ris) 


+(T, y-T.,}B.y-RvBoaw) 


+O,(Nn ey: (4.36) 
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where R,, = 2 nly y 
mator for Ry A Ae with R = 


. If we construct the regression esti- 
TT _, wehave 


Wail VnTe 


(4.37) 


where 


and 


“Ry Fi }) 


It follows that the lange: -sample-design-optimum coefficient 
for the ratio is T, x CBRN Ry Byx,w) and the ratio of 
design-optimum regression estimators is the large sample 
design-optimum regression estimator of the ratio. 


CUTIER Parc (It etn ie 
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5. MODELS AND REGRESSION ESTIMATION 


In this section we assume that the analyst postulates a 
detailed superpopulation model. Assume also that the 
sample is an unequal probability sample or (and) the 
specified error covariance structure is not a multiple of the 
identity matrix. Then, only in special cases will the design 
optimal estimator of (4.29) agree with the best estimator 
constructed under the model, conditioning on the sample 
x-values. To investigate this possible conflict, write the 
model for the population in matrix notation as 


Yy = XyPrey 


ere Oh Basra (5.1) 


where ie (GigrVe vey Yu) ey = CR e,; ey)! and 
Xj, = (X;,X5.-", Xj)’. Itis assumed that 2 yy is known or 
known up to a multiple. The model for a sample of n 
observations is 


Vie Per 

= (0, ey 
where MA = (Vie ee = (a C; Gee 
Mpa KG and we index iy cape elements by 


1, 2, ...,n, for convenience. We have used the subscript U 
to identify population quantities, and the subscript A to 
identify sample quantities, but we will often omit the 
subscript A to simplify the notation. For example, we may 
sometimes write the nxn covariance matrix as L,,. The 
unknown finite population mean is 


Vy — XyP Cy- (D2) 
Under model (5.1), the best linear, conditionally 
unbiased predictor of 0, = yy, conditional on X is 


e ys Yi iy x ae 
icA 


a Fara ed = N19) Precis 
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where il p= 2s iP: dee Kp SEL CNA aes 
Deeds = Elegeg), 


, a oi , =i 
B = (x pth x} X Leaay> 
Cr zile ya neregeens ey) Sven iS an N-n dimensional 
column vector of ones, x, is the simple sample mean, and 


A is the set of elements in U that are not in A. See Royall 
(1976). Under the model, 


Oey PS Ce (BP Ne Cr oe ee 


and 
v{6 -FyIX,}= CT Vip}C { 
+N eS aa ly Leak) Jive’ (5.4) 
where 
Ci NeNONG N)Xy_ n ~ Sy n Vi, X,|. 


Design consistency of estimator (5.3) and the situations 
in which the model estimator reduces to the Horvitz- 
Thompson estimator have been considered by, among 
others, Isaki (1970), Royall (1970, 1976), Scott and Smith 
(1974), Cassel, Sarndal, and Wretman (1976, 1979, 1983), 
Zyskind (1976), Tallis (1978), Isaki and Fuller (1982), 
Wright (1983), Pfefferman (1984), Tam (1986), Brewer, 
Hanif and Tam (1988), Montanari (1999), and Gerow and 
McCulloch (2000). 

The estimator (5.3) reduces to x vB if there is an n such 
that 


XT 7 LAA J, i LAA Syn (5.5) 


for all samples with positive probability. If there is also y 
such that 
-1 
X,Y ra LAA D, J, (5.6) 
for all samples with positive probability, then 6 of (5.3) is 
design consistent, where D, was defined for (4.14). Given 
a k such that 


X,k 3 DAA (D,’ J, -J,) -Zecaa Sen? (5.7) 
then 6 of (5.3) is expressible as 
Gare A(x), eee (5.8) 


and if the design is such that y, is design consistent for 
var 6 of (5.8) is design consistent for y 

We call a regression model of the for (5.1) for which 
(5.5) and (5.6), or (5.7), holds a full model. If (5.6) or (5.7) 
does not hold, we call the model a reduced model or a 
restricted model. We cannot expect the conditions for a full 
model to hold for every analysis variable in a general 
purpose survey because L,, will be different for different 
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y’s. Therefore, given a reduced model, one might search for 
a good model estimator in the class of design consistent 
estimators. 

To construct a design consistent estimator of the form 
XB when model (5.1) is a reduced model, we can add a 
vector satisfying (5.7) to the X-matrix to create a full model. 
There are two possible situations associated with this 
approach. In the first, the population mean (or total) of the 
added variable is known. With known mean, one can 
construct the usual regression estimator and the usual 
design variance estimation formulas are appropriate. 

To describe an estimation procedure for the situation in 
which the population mean of the added variable is not 
known, let q = pies q,) denote the added vector, 
where q is the vector on the right side of the equality in 
(5.7). Let H = (X, q), where X is the matrix of auxiliary 
variables with known population mean vector, X,. We 
write the full model for the sample as 


y = HB,., +e. (5.9) 


where e~(0, Z,,). The best linear conditionally unbiased 
estimator of B.. pais 

o -] =I , —] 

B.., = (H'Z,,H) 'H’E;Yy. (5.10) 

If the coefficient for q in (5.9) is not zero, it is not 

possible to construct a conditionally unbiased estimator of 
LB B,. , because the q N component of h N is unknown. 
However, because B... , 18 unbiased for B... , it 1s possible to 
construct a conditionally unbiased estimator of any linear 
function of B... ,- Lhus, it is natural to replace the unknown q x 
with the “best available” estimator of g y> and a reasonable 
choice is the regression estimator, 


A 


gee See Xe) Dares (5.11) 


where B,.. = (x Osae Xo Xe Ey ge Then the estimator (5.3) 
becomes 

8 = Yq *[(Kve Veg) ~ (Xe Fx) | Bon 
The estimator (5.12) can be expressed in the familiar 
regression estimator form, 


(. 12) 


Ms = Varn WEnnr Sp PD," (3.13) 
That is, the regression estimator of the finite population 
mean of y based on the full model, but with the mean of q; 
unknown and estimated with the regression estimator, is the 
regression estimator with B,., estimated by the generalized 
least squares regression of y on x using the covariance 
matrix L,,. See Park (2002). The estimator is conditionally 
model unbiased under the reduced model containing only x 
if the reduced model is true. If the population coefficient for 
q, 1S not zero, the reduced model is not true. Then the 
estimator is conditionally model biased, but the estimator is 
unbiased for the finite population mean under the full model 
and an unbiased design, because 
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E\ Joop ~ In} ~ ELE [Frog ~ In |H1]} 


age KAI bios — Gy) B,.,3= 9, (5.14) 


where Wing is defined in (5.12) and the approximation is due 
to the approximate design expectation of the regression 
estimator q_... 

The estimator (5.13) is a linear estimator, where the 
vector of weights, w, minimizes the Lagrangean 


w'd,, Ww +(w' sl eee Le (Or) 


The estimator is location invariant if the column of ones is 
in the column space of X. 

Because the variable q is the variable whose omission 
from the full model can produce a bias, it seems prudent to 
test the coefficient of g before using the reduced model to 
construct an estimator for the mean of y. This can be done 
using a model estimator of the variance, 


V{B,., |} -(H’S;) nH)" 


or using the design estimator of variance of (4.12). See 
Du Mouchel and Duncan (1983) and Fuller (1984). 

A working specification for &,, may be particularly 
appropriate for two-stage samples, see Royall (1976, 1986) 
and Montanari (1987). A reasonable model is that in which 
there is common correlation among items in the same 
primary sampling unit and zero correlation between units in 
different primary sampling units. Because the associated 
Z,, is block diagonal of a particular form, it is relatively 
easy to invert and hence the estimator based on such a 
working ® is relatively easy to construct. The regression 
estimator using a ® with a non zero correlation for units in 
the same primary sampling unit is a combination of the 
estimator based on primary sampling unit totals and that 
based on elements. See Fuller and Battese (1973). Thus, the 
use of such a ® can avoid variance problems associated 
with the use of primary sampling unit totals. 


6. MAXIMUM LIKELIHOOD AND RAKING 
RATIO 


The theoretical foundation for the regression estimators 
discussed in section 3 and section 4 is maximum likelihood 
estimation for the linear model with normal errors. We now 
consider the likelihood for multinomial variables. Given a 
simple random sample from a multinomial defined by the 
entries in a two way table, the logarithm of the likelihood, 
except for a constant, 1s 


»S ds aij log Pi 


i=l j=l 


(6.1) 


where a,, is the estimated fraction in cell 1j, p,; is the 
population fraction in cell ij, ris the number of rows, and 
c is the number of columns. If (6.1) is maximized subject to 
the restriction ))’p,,=1, one obtains the maximum 
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likelihood estimators p,, = a;; . If the marginal row fractions 
Pin and the marginal column fractions p,, , are known, it 
is natural to maximize the likelihood subject to these 
constraints by using the Lagrangean 


FF a tos ry SAS Py -Paw 


i=] j=l 


r+c 


eS, Al Er Ps (6.2) 


jzrtl 


where 4,,1 = iol 2,...,7, are for the row restrictions and 
iis J =1,2,--,¢, are for the column restrictions. There 1s no 
explicit expression for the solution to (6.2) and there may 
be no solution if there are too many empty cells. A 
procedure that produces estimates close to the maximum 
likelihood solution is that called raking ratio or iterative 
proportional fitting. The procedure iterates, first making 
ratio adjustments for the row restrictions, then making ratio 
adjustments for the column restrictions, then making a ratio 
adjustments for the row restrictions, etc. The method is 
generally credited to Deming and Stephan (1940). See, for 
example, Bishop, Fienberg and Holland (1975, Chapter 3). 

Deville and Sarmdal (1992) considered a class of 
objective functions of the form ie ,G(w,,a,), where 
G(w, a) is a measure of distance between an initial weight 
a, and a final weight w,. The objective function is mini- 
mized subject to the constraints 


(6.3) 
icA 

Deville and Sarndal (1992) used the term calibrated to 
describe yelents satisfying (6.3). If the initial weight is 
a =m dae Ti; ' and if one is the first element of x,, the 
solution to the minimization problem is approximated by a 
regression estimator of the mean of the form 
~ x, )B, 


yea (6.4) 


where 


-] ' =H 
X; ax Qi Y;> 


icA 


=O x; 95 


i€A 


and @.. is the second derivative of G(w, a) with respect to 
w evaluated at (w,a) =(a,,a,). Using this approach, 
Deville and Sérndal (1992) showed that the maximum 
likelihood and raking ratio estimators have the same 
limiting distribution as the regression estimator (4.18) with 
® = D,. To obtain the raking ratio weights they used the 
objective function 


») lw; log aw, +a,-w)], (6.5) 
icA 
and to obtain the maximum likelihood weights they used 
the objective function 

Si lw, - a, - a, log i wi]. 


1EA 


(6.6) 
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Deville, Saérndal and Sautory (1993) investigated four 
estimators in the class. Although weights constructed using 
different functions could differ considerably, the authors 
concluded that estimates were quite similar, a result 
consistent with the theory. Singh and Mohl (1996) and 
Théberge (1999, 2000) discuss estimators with the 
calibration property. 


7. POPULATION OF AUXILIARY VECTORS 
KNOWN AT ESTIMATION STEP 


If the x-vector is known for all of the population 
elements, the number of possible regression-type estimators 
is greatly expanded. Most procedures involve the fitting of 
an approximating function for the relationship between y 
and the auxiliary variables. The most used procedure is to 
assign the population elements to categories on the basis of 
the auxiliary data and to use these categories as post strata. 
This procedure is equivalent to approximating the expected 
value of y given x by a step function. The estimator is 
formally equivalent to the regression estimator (4.19) where 
the x-vector is a vector of indicator variables for post- 
stratum membership. 

The application of the procedure often requires the 
development of criteria to use in forming the post strata. 
Typically the post strata are formed so that each post 
stratum contains a minimum number of sample elements 
and so that the weights for any post stratum are not overly 
large. Estimation with post strata and the formation of post 
Strata have been studied by Fuller (1966), Holt and Smith 
(1979), Tremblay (1986) Kalton and Maligalig (1991), 
Little (1993), Eltinge and Yansaneh (1997), and Lazzeroni 
and Little (1998), among others. Holt and Smith (1979) 
argued for the use of a conditional variance estimator for 
post stratification. 

Given the population of x-vectors, one can use the 
sample to estimate a functional relationship between y and 
x and then predict the unobserved y. If the procedure is to 
be design consistent, then a condition similar to (4.14) must 
hold. One way to ensure design consistency is to require the 
fitted model to satisfy 

ye m;' ly, -f (x;, B) =0 (7.1) 
icA 
where f(x,, B) is the model estimated value for the i-th 
observation. 

Firth and Bennett (1998) pointed out that some nonlinear 
models satisfy (7.1). If the initial model does not satisfy 
(7.1), an estimated intercept term can be added to create an 
expanded full model, 


fe (x,3B) = f(x, B) 
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This is a direct extension of the ideas of difference 
estimation to the nonlinear case. See Isaki (1970), Cassel, 
Samdal and Wretman (1976) and Wright (1983). A closely 
related approach was suggested by Wu and Sitter (2001) in 
which the fitted function f(x,, B) is used as the auxiliary 
variable in a linear regression estimator. 

A number of “local” procedures, other than step 
functions, can be used to approximate the functional 
relationship between x and y. Spline functions and 
polynomials are linear models that fall within the class of 
section 4. Estimators that use some kind of local smoothing 
to estimate population quantities have been considered for 
finite populations from a model viewpoint by Kuo (1988), 
Dorfman (1993), Dorfman and Hall (1993), Chambers 
(1996), and Chambers, Dorfman and Wehrly (1993). Breidt 
and Opsomer (2000) showed that estimators based on local 
polynomial regression are design consistent. Firth and 
Bennett (1998) also considered local fit models. 


8. REGRESSION ESTIMATION AND 
NONRESPONSE 


Regression estimation is frequently a part of procedures 
used to adjust data for unit nonresponse. Regression can be 
justified on the basis of a model such as (3.1) or on the basis 
that regression can adjust for unequal response probabi- 
lities. See Cassel, Saérndal and Wretman (1979, 1983), 
Little (1982, 1986), Bethlehem (1988), Kott (1994), Fuller, 
Loughin and Baker (1994) and Fuller and An (1998). 

Consider an estimator of the population regression vector 
of the form (4.4) with ®=D_ constructed with the 
responding units. Denote the estimator by B and let p, be 
the conditional probability of observing unit i given that the 
unit is selected for the sample. Then under regularity 
conditions, the estimator B is a consistent estimator of 


! an: , 
eae, [> x; PX,| oz ReaD a (8.1) 
ieU teU 
The population mean of y can be expressed as 
SSM IRS le ae (8.2) 


where a, = y; — X,Y y and a,, is the population mean of the a,. 
The regression estimator bai =x,,B willbe consistent for y,, 
if the probability limit of a,, is zero. The probability limit 
of a,, will be zero if the sequence of finite populations is a 
sequence of random samples from an infinite population in 
which 
yenx Pre; (8.3) 

and the e. of the sample are independent of x, with 
E{e,|x,} =0. 

Alternatively, a sufficient condition for a, ote be zero 1s 
the existence of a column vector § such that 


as ea ae (8.4) 
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for i= 1,2,--,N. Thus, if the reciprocal of the response 
probability is a linear function of the control variables, the 
regression estimator is a consistent estimator of the mean of 
y. One way in which (8.4) can be satisfied is for the 
elements of x, to be dummy variables that define subgroups 
and for the response probabilities to be constant in each 
subgroup. 

If (8.4) holds and if the probability of responding is 
independent from unit to unit, then the estimated variance 
based on (4.12) is an appropriate estimator for the variance 
of the regression estimator of the mean. It is particularly 
important that a variance estimator of the form (4.12) or 
(4.25), and not of the form (4.26) be used, because x,, - X, 
is, in general, not O(n ~”) in the presence of nonresponse. 
Singh and Folsom (2000) make a similar argument for the 
variance estimator (4.25) when using regression to adjust 
for coverage error. 

Often a preliminary adjustment to the selection proba- 
bilities is made for nonresponse and this is followed by 
regression estimation. The most frequently used response 
adjustment is to form adjustment cells (post strata) and to 
ratio adjust the weights of respondents in the cell so that the 
sum of the weights is equal to the estimated (or known) 
total for the cell. See, for example, Little and Rubin (1987, 
page 250). Procedures using an estimated response proba- 
bility function are discussed by Cassel, Sarndal and 
Wretman (1983), Rosenbaum and Rubin (1983), Folsom 
and Witt (1994). Fuller and An (1998), and Folsom and 
Singh (2000). Brick, Waksberg and Keeter (1996) use an 
estimated contact probability to adjust for frame coverage. 

To consider procedures based on estimated response 
probabilities, assume that the inverse of the response 
probability for individual i is given by 


Pi = g(z,;5®), (8.5) 
where z,. is a vector of variables that can be observed for 
both respondents and nonrespondents, 0° is the true value of 8, 
and g(z,;@) is continuous in 0 with continuous first and 
second derivatives in an open set containing 6° for all Z... 
The vector (y,, X,, Z;) is observed, and we assume that p, is 
bounded below by a positive number. 

Let 6, be the indicator variable with 6, = 1 if aresponse is 
obtained and 6, = 0 if a response is not obtained. Using the 
vector (6,,Z;), the parameter @° of the response probability 
function isestimated. Assume that 8 - 0° = O _(n ~”), where 
6 is the estimator of 0. Let B ,, denote the finite population 
regression vector for the regression of y on x. Let 


ye oe Sees eet Bim ajo 
B = ( hoe east Ria 5) Dx By 5: 
icA 


ic€A 


(8.6) 


where 7, are the selection probabilities and Bon = g(Z;; 6). 
Under conditions of the type used in section 4, 


p By =My, » 5,m; P; x; a,|1 + pg, ;(6 3 0°)| 
KS 


HOL(G), 
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where g, ; is the row vector of first Leu eens of @(z,;9) 
evaluated at @ = 6° and M, =Y..,X;'X,% P; 8; If gi 
is uncorrelated with a,, then the term involving 2, 4; 1S 
O(n!) and the variance estimator constructed as if 
g(z;®°) is known is appropriate. The conditions are 
satisfied if z, is a subvector of x, and z, defines imputation 
cells (adjustment cells) with equal response rates within a 
cell. 


9. PRACTICAL CONSIDERATIONS 


If the regression weights are to be used in a general 
purpose survey, no individual weight used in estimating a 
total should be less than one. Also, it seems reasonable, on 
robustness grounds, to avoid very large weights. We discuss 
some procedures that have been developed to accomplish 
these objectives. 

A number of algorithms produce positive weights with 
a high probability. Raking ratio procedures produces 
positive weights for most data configurations. Deville, 
Sarndal and Sautory (1993) discuss the extension of raking 
ratio to general x-variables and extensions to include 
bounds on the weights. 

Tillé (1998) suggested the use of approximate 
conditional probabilities, conditional on x,, to compute an 
estimator. His approximation can be extended to produce 
regression weights that are positive with high probability. 
Let x! be an estimator obtained by deleting element i, or 
primary sampling unit i, and modifying the remaining 
weights so that x.” is unbiased, or consistent to the same 
order as X_, for the population mean of all elements 
excluding i. The estimator x, m can be the estimator used to 
construct jackknife deviates. Let x, be an estimator of the 
covariance matrix of x, and let Lei be an estimator of 
the conditional covariance matrix of x, conditional on 
i¢ A. Then, in large samples x, and x! are approximately 
normally distributed and an estimator of the probability that 
iis in the sample given the estimated mean x_, is 

R,,=P {ieA |F,,, X,} 
=i | pope ea es Plex {0.5 (G,, A Cates (9.1) 


where 


and x) =(N-1)7! (NX, -x,). For simple random 
sampling, Tillé (1998) showed that the estimator 


E a -1 
im’, ys Nija dir 


teA 


(9.2) 


where 7. i|A is the conditional probability calculated under 
the normality assumptions, 1s approximately equal to the 
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regression estimator. Because the estimator is not 
calibrated, we suggest a calibrated version obtained by 
computing the regression estimator with 7,,, as initial 
weights. The difference between (9.2) and the regression 
estimator constructed with initial weights 7,,, is O Aum iy 

Hence, there is a good chance that the repression weights sO 
constructed will be positive. The variance estimator =e @ 

is relatively simple to compute for stratified samples but 
may require considerable computation for other cases. Thus 
one may choose to approximate L_ ,,,. 

Given that the regression weights are being constructed 
by minimizing an objective function, one can add 
restrictions to the problem to place bounds on the weights. 
Huang and Fuller (1978) gave an interative procedure 
equivalent to constructing a ® matrix at each step that 
reduces the weight on observations whose current weight 
deviates from the average by a large absolute amount. 

To discuss additional procedures associated with 
quadratic objective functions, assume we have a working 
covariance matrix, denoted by ®__, for the model (5.1) that 
is to be used to construct a regression estimator. Let @ be 
the column vector of initial weights and assume ®, a. is in 
the column space of X. Then the weights that minimize the 
conditional model variance are the weights that minimize 
wD, Ww Or, equivalently, that minimize 


(w - a)’ ®,,(w - a) (9.3) 
subject to the constraint 
wX =x (9.4) 


Given an objective function, we can add restrictions on the w; 
such as 


1S Wis ook SOAS (9.5) 


where L, and L, are nonnegative constants. Minimizing 
(9.3), subject to the constraints (9.4) and (9.5) is a quadratic 
programming problem. The use of quadratic programming 
was suggested by Husain (1969) and was used by Isaki, 
Tsay and Fuller (2000). 

If a large number of control variables are used, it may not 
be possible to construct weights satisfying the calibration 
constraints and also falling within reasonable bounds. The 
practitioner is faced with making compromises. The most 
common practice is to drop variables from the model. See 
Bankier, Rathwell and Maijkowski (1992) and Silva and 
Skinner (1997). To discuss an alternative procedure, 
consider the situation in which some of the constraints are 
required but others can be relaxed. Let the matrix of 
observations on the auxiliary variables be partitioned as 
(Xp, X,), where X, is the set of variables for which exact 
constraints are required and X, is the set for which the 
constraints can be relaxed. Assume ® a. is in the column 
space of X,. Then a generalization of (9.3) and (9.4) is the 
function 


(w-a)’®, (w-a)+(w'X, ~X, vy) E(w X, -X, y) (9.6) 
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and the constraint 


w'X, (9.7) 


where @_, and ¥ are positive definite symmetric matrices 
and x, = (Xp as X) vy): The w that minimizes (9.6) subject 
to (9.7) minimizes the mean squared error of the unbiased 
linear predictor of x ,, B under the mixed model 


y = XB, + X, B, +e, 


where B, ~ (0, ¥), e ~ (0, @,_), the random vector B, is 
independent of e, and B, is a fixed vector. See Lazzeroni 
and Little (1998) for the use of random models for post 
stratification. 

The vector w’ that minimizes (9.6) subject to restriction 
(9.7) is 


S056 = 0, 


w = a’ +(X,-X, Hy, X'®), (9.8) 
where 
’ =H! , S| 
is , X, ®,, X, X, ®,, X, 
Lx @! x, wlax/@!x 
B ee 0 - D ee 5} (9.9) 
The estimator can be written 
Pre = Pins + (Xy -X,) 6, (9.10) 


where 6 = H,, ym ®D, “y. See Henderson (1963), Robinson 
(1991), and Rao (2002, Chapter 6). 

Husain (1969) considered (9.6) for a simple random 
sample from a normal distribution with X, = J, ®,, =I, 
and WY! = 'Z,. pe awhere r op smiscthe Bmteds cova- 
riance matrix of x, _, andy isa constant to be determined. 
For this case, ater showed that the optimal y is 

Yori a (hol R7)|" (nok, —2).R?, (9.11) 
where k, is the dimension of x, and R* is the squared 
multiple correlation coefficient. Bardsley and Chambers 
(1984) considered the function (9.6), the division of x, into 
two components, and studied the behavior of the estimator 
from a model perspective. The procedure associated with 
(9.5), (9.6) and (9.7) was used by Isaki, Tsay and Fuller 
(2000). In that application, the vector Xo ny contained 
marginal totals of a multiway table and x, ,, contained 
totals for interior cells. Rao and Singh (1997) studied a 
closely related estimator in which tolerances are given for 
ES difference between the final estimates for elements of 

y and the corresponding elements of x, 

ek (2002) extended Husain’s oe at: results to a 
more general . The x, vector can be transformed so that 
V{x X, ,} for the transformed vector 1s a diagonal matrix and 
so that x D, 8 is a diagonal matrix, where X, is the 
part of X, eta is orihaponal to X, in the metric o, . That 
1s, 


7 / = -1 / =I 
X, =X, - X,(X; ae X,] X, ®,, X,. 


AZ 


Then the diagonal that minimizes the approximate 
variance has elements 


ein? 
Va iy pee 'B; 


where Mii is the th element of the diagonal matrix 
xe ®, dX § and ; is the variance of B. in the 
transformed scale. ae implement the procedure one must 
estimate the population parameters or choose realistic 
values for a general purpose P. If one postulates a super- 
population random model for B, then the B of (9.12) is 
replaced with E {B; }, where the expectation is the model 


expectation. 


(9.12) 


10. COMMENTS 


Regression estimation is a flexible and powerful tool for 
the incorporation of auxiliary information into the esti- 
mation process. Closely related procedures, such as raking 
ratio, have large sample properties equivalent to those of 
regression estimators. The linearity of such estimators is of 
paramount importance because it permits the construction 
of a general purpose data set that provides very good 
estimators for a wide range of parameters. 

Given a concentrated interest in a single y-variable, 
efficiency gains may be possible by postulating a particular 
set of auxiliary variables and a particular error covariance 
matrix. Because of the simple nature of the design 
consistency requirement, it is easy to test such models for 
design consistency. 
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APPENDIX 


This appendix contains theorems supporting the limiting 
properties of the regression estimators discussed in section 4. 


Theorem A-U Tet (UF, Ay, aya ~ K+ 32K Ss 4, 22} 
be a sequence of finite populations and samples, where F “i 
is a sample from an infinite population with eighth 
moments, A, is the sample of size n,, selected from the Nth 
population. Let B be defined by (4.4) of the text, and let 
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Q, =n 2 "Z, 
where ® is a positive definite symmetric n x n matrix that 
may be a function of X but not of y, Z is defined following 
(4.2), and we omit the subscript N on sample quantities. 
Assume Q. is positive definite yeu probability one. If ® 
is random, assume the rows of ®7! Z, have bounded fourth 


moments. Assume the selection probabilities satisfy 
O<K Nn sek, 


where 7, are the selection probabilities. Assume the sample 
design is such that for any z with bounded fourth moments 


Zea Fy) (Op Oe, IN Hpk Oo) excl) 


where 


= ee = es =] = 
Zap = (Yup Xyr) = NV ye 1; Z;, 


icA 


(A.2) 


Ory Et Q..| F ,},Z, is the finite population mean of 
Z, Q_.y 18 a positive definite matrix for the Nth population, 
and the limit of Q_.,, 1s positive definite. Then 


B-BylFy % Qin b ir oy) 


+0,(n1), 


where B, = Quy Oe ft SSNetS Ge Ebesbis 
n'Nn,6 e,, 
Oude | Qo Q.. (A.4) 
uN ~ , 
Qa Qua 
e,=y,-x,B,, and CG is column i of X’@". 
Assume the design is such that 
aves > L 
Viz {Zur ~2y|Fy} > NOI), (A.5) 


as ny for any z with finite sei moments, where vie 
iS the covariance matrix of Z,, n- Assume that V— is 
O(n~') and that the design Sy an estimator ve such 
that 


n(V_-V.)|Fy=0,(1) (A.6) 
for any z with bounded fourth moments. Then 
(V (BB -ByI Fy * NOD, Ce 
where 
V{p} =O. V;,0., (A.8) 
Viz = =V {b', 7} is the estimated Ces variance of bir 


éaleiated with bi =n 'Nx,C/é, and é, = 


a ee be: 


Proof. The error in B i is 
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B-By = (X’@'X)'[x'@ ly -x’@' xp, | 
= Os (niXid te): 


Now f is a generalized least squares estimator. Therefore 
é’@ 'X = (y - Xf)’ ®'X = 0 

and Q.y ~ By Quy = Q.y = 9. By assumption (A.1) 
Q).= 2 'X'@®1e=0(n”). 


Thus 
B-By = 7 > Ge) + 0,(n*) 
1E€A 


co le etch abet ak 


The b, have bounded fourth moments by the assumptions. 
Thus, by assumption (A.5) 


vi (B-B,) > NOL), 
where 


=] =] 
Vv B a Qu V e6 Qu 


and V;; = V {b,,,}. Now 


n'X'@16 =n 'X'@le+n !X'@'X(B, -B] 
=N71)> a,'b/+N ">> n'h’, 
icA icA 
where 
hj= n' N17, x; 55 
and ds Slip B. For any fixed 6, by (A.6), the estimated 
ince of N- Pepin (b; +h;) is consistent for the 
variance of the estimator of the mean of bth. By 


assumption, the elements of C'x , have fourth moments. For 
a fixed 6 the variance of h,, _ is O(n7!). For 6 = dp 


V{ hy} = 0,(n-! ie 
and 
V bee here Velie eo 8 siat) 
because 5, = O,(n ZY Result (A.7) then follows from the 
asymptotic normality of B - By. 
Theorem A.2. Let? ao yo=iGy va ape 
X’ =(x;,X;,~-,x/). Let ® be a nonsingular symmetric 


nxn matrix and let ®,, be a nonsingular symmetric Nx N 
matrix. Let 


¥,>X,,n | (X'M'X) and n'X’@'y 
be design consistent estimators for finite population 


characteristics y,,Xy,Q.y and Q..,, respectively, where 
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(Quy: Quy) =[N 2X jO" Xy, NX y Oy yy | (A9) 


let ips QvQ, . Let there be a sequence of column 


vectors {yy} GA that 
Xv 


for all possible samples, where D, = diag (t,, Boe.) and 
J P is an n-dimensional column vector of ones. Then, the 
regression estimator XB with 


B = (x’®'!x)'x’@'y, 
is a design consistent estimator of y ,,. 


Proof. If p is defined by (A.11), then by the properties of 
generalized least squares estimators, 


(y - xp) @'x = 0. 
If (A.10) holds, then 
\(v.-*.8) 


(y - xB)'D,'S = ( a 
icA 
It follows that y_. : is design consistent because 
O=p ue {(5,-%,B,)| Fy} 


= ®D,'J, (A.10) 


(A.11) 


*X,By)| Fv} 


- XB y)| Dae 


Pain 
mee 


Theorem A.3. Let a sequence of populations and samples 
be as defined in Theorem A.1. Let z,; be a vector of the 
form zZ, =(y,, 1,X, ;) and let z, , = (y,»X,,). Assume. Z 
isa design ee estimator of the ence mean Z 
with nonsingular covariance matrix 


Zz, ,|Fy} = O(n!) 


1, 
1,N 
(A.12) 


and 


rin r L 
n*(Z, Wig Nie: N(0, nui (A.13) 
where X— is the limit of nV{Z, 
estimator of the variance of 2, 
such that 
p lim n'*°(V{z, ) -V(z 
cane 


ae }. Assume there is an 
Ridenoied by Vila yy 


,|Fy}) =0 (A.14) 


be the vector that minimizes 


(A.15) 


for some 6> 0. Let Bins 


V{ LADS, X, By, a 
Bi dope «=e =the §=vector that minimizes 
LS abo ny Let y, me be defined by (4.29). Then 


has the minimum limit variance for design consistent 
—X, )B, g- Also 


and let 
V{ 


Ma. reg 
estimators of the form y, + (X, 


x Oe tabs Whe, 15 
i 2,)| “(Fy ee Yn) > NCO, 1), (A.16) 
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where V{ é} is the estimator of (A.14) constructed with 
cas aa Pataeh © ia wl | Apa 
Proof. The estimator 


B i dope ie Vix, .}| CAR aye} 
minimizes the estimated variance of (A.15), and, by 
assumption (A.14), the estimated variance is consistent for 
the true variance. Hence, B, 4, 3 is design consistent for 
By More and B, go, Minimizes V{y - X, ,B}. Therefore, no 
estimator of the form (4. 29) has a inane distribution with 
smaller variance. 


Now 
Vater ay 8 any i (Xv 7 Xe By opt 
* en aS Oy (n as 
where ¢; = y,-Vy- (% ; — X, y) By dopy Therefore the 


variance of the limiting distribution of n 7 ( ¥ ere ae 
the variance of n” ( é_-@,). By assumption (A. 14), the 
estimator V {Z, y} is A consistent variance estimator of 
V (zy) for any fixed y. Because B, dopt wrasse =On6)), 
the estimated variance based on @, converges to the 
estimated variance based on e, and (A. 16) holds. 
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Leslie Kish’s Impact on Survey Statistics 


GRAHAM KALTON' 


ABSTRACT 


Leslie Kish, one of the pioneers of survey sampling, died on October 7, 2000, at the age of 90. This paper reviews his impact 
on survey Statistics, mainly in terms of his research but also in terms of his promotion of sound probability sampling 
methods around the world. Kish’s research was broad-ranging, covering sampling methods, variance estimation and design 
effects, nonsampling errors, small area estimation, survey designs across time and space, and observational studies. He 
promoted probability sampling designs through consultancies in many countries, his writings, and in particular through the 
highly effective intensive summer Sampling Program for Foreign Statisticians that he established at the Survey Research 


Center of the University of Michigan. 


KEY WORDS: Sample design; Variance estimation; Nonsampling errors; Rolling samples. 


1. INTRODUCTION 


Leslie Kish, one of the pioneers of survey sampling, died 
on October 7, 2000, at the age of 90. During his long and 
productive career, he had a major impact on the field, 
achieved both through his impressive research contributions 
and through his extremely successful promotion of the use 
of scientific probability sampling methods throughout the 
world, and especially in developing countries. His wide- 
ranging research always focused on issues of practical 
importance, and his innovations facilitated the use of effec- 
tive probability sampling in diverse areas. He promoted the 
practice of probability sampling through his expository 
writings (particularly for sociologists and demographers), 
through his numerous consultancies and advisory services, 
and through his training of survey statisticians, particularly 
those from developing countries. 

This paper reviews Kish’s impact on survey statistics, 
primarily with respect to his contributions to the advance- 
ment of survey sampling and survey research more gene- 
rally. It is useful to start with a brief account of his career in 
order to place these contributions in a temporal context. The 
interview of Kish in 1994 by Frankel and King (1996) is 
recommended for those interested in more details of Kish’s 
fascinating life. Some of the material in this paper is drawn 
from that interview. 

Kish was born in 1910 in Poprad, which was then part of 
the Austro-Hungarian Empire and is now in Slovakia. In 
1926, he emigrated to the United States with his family. 
When his father died the following year, he became a 
laboratory assistant at the Rockefeller Institute for Medical 
Research, while attending Bay Ridge Evening High School. 
He graduated from high school in 1930 and enrolled in the 
College of the City of New York night school, while 
continuing to work for 54 hours a week at the Rockefeller 
Institute. His interest in statistics arose out of his work at 
the Institute, and he studied on his own books by Fisher, 
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Yule, Wallace and Snedecor, Tippett, Pearl, and others. In 
1937, he interrupted his education to join the International 
Brigade to fight for the Loyalist cause in the Spanish Civil 
War. He returned to the United States in 1939 and earned 
a B.S. in Mathematics, cum laude, in that year. He was then 
hired by the U.S. Census Bureau as a Section Head, and 
subsequently moved to be a Statistician at the United States 
Department of Agriculture (USDA) Division of Program 
Surveys. In 1942, he left the Division of Program Surveys 
for war service, returning there in 1945 after the war. In 
1947, he moved with a group of USDA colleagues headed 
by Rensis Likert to set up the Survey Research Center at the 
University of Michigan. He remained at the Survey 
Research Center until his retirement in 1981, when he 
became a Professor Emeritus. He remained fully active 
professionally until his death in 2000. 


2. RESEARCH 


At the start of Kish’s career, survey sampling was in its 
infancy. Much survey research was based on nonprobability 
samples. Methods for probability sampling were under 
development and many problems remained to be resolved. 
While at the USDA, Kish identified three important 
problems that he pursued at the Survey Research Center 
(SRC) in developing sampling methods there. 

One of these problems was how to have an interviewer 
randomly select an individual within a sampled household. 
At the time, probability sampling methods for sampling 
households had been developed and were being applied in 
the Current Population Survey, but the CPS collected data 
on all members of sampled households, so that no selection 
of persons within households was needed. Kish invented a 
method for objective respondent selection and wrote it up 
in a memorandum. He was urged by his colleague Angus 
Campbell to submit the work for publication, and it resulted 
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in the famous paper that was his first published research 
(Kish 1949). The widely used method is now known as the 
Kish selection table. 

The second problem that Kish identified was counting 
nonresponse. He had to argue for counting and reporting 
nonresponse with probability samples against the concerns 
of colleagues who felt that to do so would put the SRC at a 
competitive disadvantage, particularly with organizations 
using nonprobability methods. He won his case and SRC 
adopted his approach, which is now fully accepted as 
standard good practice. 

The third problem was that of deep stratification. 
Standard stratification assumes independence of selections 
between strata, with the maximum number of strata possible 
being the number of selections. Particularly when the 
number of selections is small, as is often the case with the 
primary sampling units (PSUs) in a multistage design, it can 
be desirable to obtain greater balance in the sample than 
standard stratification permits. With Roe Goodman, Kish 
developed the technique of controlled selection that 
provides that greater balance by dropping the requirement 
of independence of selections between strata, while still 
retaining probability sampling (Goodman and Kish 1950). 
Kish, who was always concerned to coin good names, 
preferred to call the technique ‘multiple stratification’, and 
he uses that term in his sampling text (Kish 1965a). 

Kish’s subsequent research in survey statistics was 
wide-ranging, covering many aspects of survey sampling, 
nonsampling errors, small area estimation, survey designs 
across time and space, and observational studies. His many 
contributions have had a major impact on the development 
of the practice of survey sampling and of survey research 
more generally. The following paragraphs outline some of 
his contributions organized by topic. 

Variance estimation. Before the 1970s, the analysis of 
survey data was severely limited by the analytic tools 
available, then mostly punch card equipment, such as 
counter-sorters and tabulators, and hand calculators. Thus, 
for example, weights — and particularly non-integer weights 
— were difficult to handle. For this reason Kish examined 
the use of uniform weights with the Kish selection table, 
even though unbiased estimation calls for weights propor- 
tional to the number of eligible household members. 

As a result of the computational difficulties, prior to the 
1970s sampling errors were rarely computed in a manner 
that reflected the complex sample designs typically 
employed in survey research. A widespread practice was to 
compute variances as if a simple random sample (SRS) had 
been drawn. Kish sought to promote the use of appropriate 
variance estimation methods by social researchers, which 
he did by illustrating the sizable underestimation that often 
arises when SRS formulas are applied to clustered samples 
(Kish 1957). Initially he developed and applied simple 
computational procedures, emphasizing the simplicity that 
can be obtained with a paired selection design in which two 
PSUs are sampled in each stratum (Kish and Hess 1959a; 
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Kish 1968). He coined the term “design effect” for the ratio 
of the variance of a survey estimate for a given design to the 


variance of the same estimate obtained from a simple 


random sample of the same size. He made much use of this 
concept in his famous Survey Sampling book (Kish 1965a), 
which provides an encyclopedic treatment of practical 
survey sampling and is still widely read as a Wiley classic. 
He retained his interest in design effects throughout his 
career as an important tool in the design and analysis of 
survey samples (see, for example, Kish 1982, 1995a; Kish, 
Frankel, Verma and Kaciroti 1995; Kish, Groves and Krotki 
1976). An important term in the design effect for a clustered 
sample is the intra-class correlation, which is featured in 
Kish’s Ph.D. dissertation (Kish 1952) and in a number of 
his other papers (e.g., Kish 1954, 1961a). 

With the development of computers, Kish was quick to 
see their importance for variance estimation, and with SRC 
colleagues he developed an early Sampling Error Program 
Package (Kish, Frankel and Van Eck 1972). With his 
doctoral student Martin Frankel, he also extended the range 
of statistics for which sampling errors from complex sample 
designs could be computed (Kish and Frankel 1970, 1974). 
This highly influential research developed, applied, and 
evaluated balanced repeated replication (BRR) and jack- 
knife repeated replication (JRR) methods of variance esti- 
mation. It also provided a definition of the population 
parameters estimated by analytical survey statistics in the 
finite population context. 

Multipurpose surveys. The survey sampling literature 
deals mostly with an efficient sample design for estimating 
a single population parameter. Kish recognized the limita- 
tion of this approach since virtually all surveys are multi- 
purpose in nature. He wrote several important papers 
dealing with multipurpose surveys, producing effective 
compromise designs that provide estimates not only for the 
population as a whole but also for various domains (Kish 
1961b, 1969, 1976; Anderson, Kish and Cornell 1976; Kish 
and Anderson 1978; Kish 1980; Kish 1988). In recent years, 
he extended his interests to multipopulation surveys (e.g., 
Kish 1999, 2002). 

Small area estimation. In considering the production of 
estimates for domains, Kish (1980, 1987a, 1987b) classified 
domains into major, minor, and mini domains and rare 
items. Estimates for major domains can be produced from 
a survey using standard sample-based estimators, particu- 
larly if the sample is designed to give sufficient domain 
sample sizes for this purpose. The sample sizes of most 
surveys preclude the production of estimates of adequate 
precision for minor or mini domains that comprise less than, 
say, one-tenth of the population. Yet, as Kish recognized 
early on, the demand for up-to-date estimates for small 
domains, particularly small geographical areas, would 
expand. This recognition led to his research in two related 
areas. 

When a survey’s sample size is too small to produce 
small area sample-based estimates of adequate precision, 
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reliance may be placed on statistical models to produce 
indirect estimates. Much research on small area estimation 
techniques using this model-dependent approach has been 
conducted in recent years. In the 1970’s, Kish contributed 
to the development of the field through his direction of 
three doctoral dissertations at the University of Michigan 
(Ericksen 1973; Kalsbeek 1973; Purcell and Kish 1979, 
1980). 

Direct, or sample-based, estimates for small domains are 
sometimes possible. One obvious source of estimates for 
domains of any size is a population census, and indeed 
censuses are a major source of small domain estimates. 
However, data from a decennial census become out-of-date 
as the decade progresses. To address this problem, Kish 
proposed replacing the census by a rotating or rolling 
sample so that, by spreading the data collection over time, 
more up-to-date estimates can be produced. He first 
proposed such a procedure in 1979 (Kish 1979a,b), and 
wrote many papers on this topic after that (Kish 1981, 1983, 
1986, 1990, 1997, 1998, 2002; Kish and Verma 1986), 
including the issue of how to cumulate sample data over 
time (Kish 1999). In another paper in this volume, Charles 
Alexander (2002) provides a detailed review of Kish’s work 
on this topic and its influence on the large-scale continuous 
survey, the American Community Survey, that the U.S. 
Census Bureau plans to introduce to replace the long form 
in the 2010 Census. 

Special sample design problems. During the course of 
his work, Kish encountered a number of specialized 
sampling problems that often occur and he offered some 
efficient solutions. The areas to which he contributed 
include the following: 


— Sampling rare and elusive populations. One of the 
most challenging design tasks faced by sampling 
statisticians is constructing an efficient sample 
design for a rare or elusive population (such as 
persons with a rare illness or the homeless). Kish 
(1965b, 1991) provides insightful reviews of 
methods for tackling this type of problem. 


— Maximizing overlap. When a population is 
sampled repeatedly over time, the issue arises of 
how to control the sample overlap between one 
round and the next. A particular example occurs 
when a master sample of PSUs is used and needs 
to be updated when new census data become 
available. Frequently it is desirable to maximize 
the overlap in the sample of PSUs, while updating 
measures of size and changing the stratification to 
reflect current data. Kish and Scott (1971) provide 
a relatively simple and effective method of 
satisfying these requirements. 


— Sampling organizations of unequal size. Some 
surveys are designed to produce estimates for units 
at different levels, for instance, for hospitals and 
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for patients. When hospitals vary considerably in 
their numbers of patients, a design conflict arises 
between the production of efficient hospital- and 
patient-level estimates. Kish (1965c) examines this 
problem and clarifies the issues involved. 


Nonsampling errors. Kish clearly recognized the harmful 
effects that nonsampling errors can have on the quality of 
survey estimates. Early in his career he collaborated with 
Jack Lansing to investigate the response errors in 
respondents’ reports of the values of their homes by 
comparing these reports with estimates made by 
professional appraisers (Kish and Lansing 1954). In his 
studies of interviewer variance, he took advantage of the 
theory on cluster sampling, measuring interviewer variance 
with the intra-class correlation coefficient, and determining 
the optimum number of interviews per interviewer based on 
a simple cluster sample cost model (Kish 1962). With Irene 
Hess, he conducted a study of noncoverage in area samples 
of dwelling units. The study was stimulated by a 10 percent 
noncoverage rate in SRC surveys at that time, and led to 
improvements that reduced this rate to about 3 percent 
(Kish and Hess 1958). Also with Irene Hess, he introduced 
an imaginative replacement procedure for noncontacts in 
one survey by substituting noncontacts from a previous, 
similar, survey (Kish and Hess 1959b). For stochastic 
imputation schemes, Kish was an early proponent of 
replicating the imputations to reduce imputation variance, 
in what he termed a repeated replication imputation 
procedure (RRIP) and what is now known as fractional 
imputation (Kalton and Kish 1984). 

Observational studies. Early in his career, Kish (1959) 
wrote a widely cited paper on the design of studies to 
investigate causal relationships, particularly nonrandomized 
studies. In his writing about this topic he made use of his 
survey sampling expertise as, for instance, in the relation- 
ship between stratification and matching (Anderson, Kish 
and Cornell 1980). His work developed into his book 
Statistical Design for Research (Kish 1987a) in which he 
compared surveys, experiments, and observational studies 
for investigating causal effects in terms of the three R’s: 
realism, randomization and representativeness (see also 
Kish 1975). He also made clear the importance of assessing 
both bias and variance in assessing the ability of different 
study designs to measure causal effects, rather than concen- 
trating on bias as had been common in the literature on this 
topic. 


3. OTHER CONTRIBUTIONS 


Kish’s seminal and wide-ranging contributions to the 
methodology of survey statistics are of great importance. 
Yet of possibly even greater importance are his contri- 
butions to the promotion of the use of sound probability 
sampling methods around the world. 
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Kish’s writings, of course, contributed to the current 
widespread use of probability sampling methods by 
emphasizing good practical methods. His three books 
Survey Sampling (Kish 1965a), Statistical Design for 
Research (Kish 1987a), and Sampling Methods for 
Agricultural Surveys (Kish 1989) are all extremely valuable 
in this respect, as are his expository writings for social 
scientists. 

Kish had a long-standing dedication to assisting deve- 
loping and transition countries, and that can be seen in 
many of his activities. He was a sampling consultant to the 
World Fertility Survey from 1973 to 1983 and he consulted 
in many countries, he ran a training program for foreign 
Statisticians, and he wrote specifically for statisticians in 
developing countries. Sampling Methods for Agricultural 
Surveys was, for instance, written for the FAO, particularly 
for use in developing countries. He contributed a 
Questions/Answers column for the Survey Statistician, the 
newsletter of the International Association of Survey 
Statisticians, from 1978 to 1994. In that column he provided 
sound advice on many practical sampling problems that 
frequently arise but that are not well addressed in the 
literature. The column was considered so useful that the 
IASS published the full set of questions and answers in a 
special volume (Kish 1995b). 

Kish was rightly particularly proud of the intensive 
two-month summer Sampling Program for Foreign 
Statisticians that he established at the Survey Research 
Center in 1961. The SPFS has now trained more than 500 
survey Statisticians from 105 countries. It is significant that 
Kish chose “Developing samplers for developing countries” 
as the topic for his 1994 Morris Hansen Memorial Lecture 
(Kish 1996). To help maintain this important program, the 
Leslie Kish International Fellows Fund was established at 
the University of Michigan at a celebration of Kish’s 90th 
birthday. Of all his accomplishments, the SPFS was the one 
that gave him greatest pleasure. 


4. CONCLUDING REMARKS 


Leslie Kish is a giant in the field of survey sampling. His 
contributions were enormous and recognized by many 
honors. These honors included, among others, President of 
the International Association of Survey Statisticians in 
1983-85, President of the American Statistical Association 
in 1978 (see Kish 1978, for his Presidential address on 
“Chance, Statistics and Statisticians”), Honorary Fellow of 
the International Statistical Institute, Honorary Fellow of 
the Royal Statistical Society, Honorary Member of the 
Hungarian Academy of Sciences, Fellow of the American 
Association for the Advancement of Science, Fellow of the 
American Academy of Arts and Sciences, recipient of the 
American Statistical Association's Samuel L. Wilks Award 
in 1997, recipient of the Mindel Shep Award from the 
Population Association of America in 1998, recipient of the 
Methodology Award from the American Sociological 
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Association in 1989, and honorary degrees from the 
University of Bologna, the Athens University of Economics 
and Business, and the Eotvos Lorand University in 
Budapest. 

Yet Kish remained down-to-earth, approachable by all. 
He had a great enthusiasm for many subjects including 
sport, art, literature, politics, philosophy, and science. He 
was always concerned with improving the conditions of the 
world’s population. He was particularly interested in young 
people and one of his favorite sayings was “Keep young by 
being curious, and have young friends”. Undoubtedly his 
endearing personality played an important part in his great 
success in promoting sound sampling methods around the 
world. Ivan Fellegi’s excellent obituary in Survey 
Methodology was aptly titled “Leslie Kish — A Life of 
Giving” (Fellegi 2000). Kish gave so much personally to so 
many people and so much professionally to the develop- 
ment of survey statistics. 
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New Paradigms (Models) for Probability Sampling 


LESLIE KISH! 


1. STATISTICS AS A NEW PARADIGM 


In several sections I discuss new concepts in diverse 
aspects of sampling, but I feel uncertain whether to call 
them new paradigms or new models or just new methods. 
Because of my uncertainty and lack of self-confidence, I 
ask the readers to choose that term with which they are most 
comfortable. I prefer to remove the choice of that term 
from becoming an obstacle to our mutual understanding. 

Sampling is a branch of and a tool for statistics, and the 
field of statistics was founded as a new paradigm in 1810 
by Quetelet (Porter 1987; Stigler 1986). This was later than 
the arrival of some sciences: of astronomy, of chemistry, of 
physics. “At the end of the seventeenth century the 
philosophical studies of cause and chance...began to move 
close together... During the eighteenth and nineteenth 
centuries the realization grew continually stronger that 
aggregates of events may obey laws even when individuals 
do not.” (Kendall 1968). The predictable, meaningful, and 
useful regularities in the behavior of population aggregates 
of unpredictable individuals were named “statistics” and 
were a great discovery. 

Thus Quetelet and others computed national (and other) 
birth rates, death rates, suicide rates, homicide rates, 
insurance rates, etc. from individual events that are unpre- 
dictable. These statistics are basic to fields like demography 
and sociology. Furthermore, the ideas of statistics were 
taken later during the nineteenth century also into biology 
by Frances Galton and Karl Pearson, and into physics by 
Maxwell, and were developed greatly both in theory and 
applications. 

Statistics and statisticians deal with the effects of chance 
events on empirical data. The mathematics of chance had 
been developed centuries earlier for gambling games and 
for errors of observation in astronomy. Also data have been 
compiled for commerce, banking, and government. But 
combining chance with real data needs a new theoretical 
view, a new paradigm. Thus statistical science and its 
various branches arrived late in history and in academia, 
and they are products of the maturity of human 
development (Kish 1985). 

The populations of random individuals comprise the 
most basic concept of statistics. It provides the foundation 
for distribution theories, inferences, sampling theory, 
experimental design, etc. And the statistics paradigm differs 
fundamentally from the deterministic outlook of cause and 
effect, and of precise relations in the other sciences and 
mathematics. 


2. THE PARADIGM OF SAMPLING 


The Representative Method is the title of an important 
monograph, almost a century after the birth of statistics and 
over a century ago now, which is generally accepted as the 
birth of modern sampling (Kiaer 1895). That term has been 
used in several landmark papers since then (Jensen 1926; 
Neyman 1934; Kruskal and Mosteller 1979a, 1979b, 1979c, 
1980). The last authors agree that the term “representative” 
has been used for so many specific methods and with so 
many meanings that it does not denote any single method. 
However, as Kiaer used it, and as it is still used generally, 
it refers to the aims of selecting a sample to represent a 
population specified in space, in time, and by other 
definitions, in order to make statistical inferences from the 
sample to that specified population. Thus a national 
representative sample demands careful operations for 
selecting the sample from all elements of the national 
population, not only from some arbitrary domain such as a 
“typical” city or province, or from some subset, either 
defined or undefined. 

The scientifically accepted method for survey sampling 
is probability sampling, which assures known positive 
probabilities of selection for every element in the frame 
population. The frame provides the equivalent of listings of 
sampling units for each stage of selection. The sampling 
frame for the entire population is needed for mechanical 
operations of random selection. This is the basis for 
statistical inferences from the sample statistics to the 
corresponding population statistics (parameters) (Hansen, 
Hurwitz and Madow 1953a, 1953b). This insistence on 
inferences based on selections from frame populations is a 
different paradigm from the unspecified or model based 
approaches of most statistical analyses. 

It took a half century from Kiaer’s paper to the wide 
acceptance of survey sampling. In addition to neglect and 
passive resistance, there was a great deal of active 
opposition by national statistical offices which distrusted 
sampling methods to replace the complete counts of 
censuses. Some even preferred the “monograph method,” 
which offered complete counts of a “typical” or 
“representative” province or district instead of randomly 
selected national sample (O’ Muircheartaigh and Wong 
1981). In addition to political opposition, there were also 
many opponents among academic disciplines, and among 
academic statisticians. The tide in favor of probability 
sampling turned with the report of the UN Statistical 
Commission led by Mahanalobis and Yates (United Nations 
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Statistical Office 1950). Five influential textbooks between 
1949 and 1954 started a flood of articles with both theory 
and wide applications. 

The strength, the breadth, and the duration of resistance 
to the concepts and use of probability sampling of frame 
populations implies that this was a new paradigm that 
needed a new outlook both by the public and the 
professionals. 


3. COMPLEX POPULATIONS 


The need for strict probability selection from a 
population frame for inferences from the sample to a finite 
population is but one distinction of survey sampling. But 
even more important and difficult problems are caused by 
the complex distributions of the elements in all the popu- 
lations. These complexities present a great contrast with the 
simple model of independence that is assumed, explicitly or 
implicitly, by almost all statistical theory, all mathematical 
Statistics. 

The assumption of independent or uncorrelated obser- 
vations of variables or elements underlies mathematical 
Statistics and distribution theory. We need not distinguish 
here between independently and identically distributed 
(ID) random variables and “exchangeability,’ and 
“superpopulations.” The simplicity underlying each of 
those models is necessary for the complexities of the 
mathematical developments. 

Simple models are needed and used for early stages and 
introductions in all the sciences: for example, perfect 
circular paths for the planets or d=gt*/2 for freely 
dropping objects in frictionless situations. But those models 
fail to meet the complexities of the actual physical world. 
Similarly, independence of elements does not exist in any 
population whether human, animal, plant, physical, 
chemical, biological. The simple independent models may 
serve well enough for small samples; and the Poisson 
distribution of deaths by horsekicks in the Prussian Army in 
43 years has often served as an example (precious because 
rare) (Fisher 1926). 

There have also been attempts to construct theoretical 
populations of IID elements; perhaps the most famous was 
the classic “collective” of Von Mises (1931); but they do 
not correspond to actual populations. However, with great 
effort tables of random numbers have been constructed that 
have passed all tests. These have been widely used in 
modern designs of experiments and sample surveys. 
Replication and randomization are two of the most basic 
concepts of modern statistics following the concept of 
populations. 

The simple concept of a population of independent 
elements does not describe adequately the complex distri- 
butions (in space, in time, in classes) of elements. 
Clustering and stratification are common names for 
ubiquitous complexities. Furthermore, it appears impossible 
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to form models that would better describe actual popu- 
lations. The distributions are much too complex and they 
are also different for every survey variable. These 
complexities and differences have been investigated and 
presented now in thousands of computations of “design 
eirects:” 

Survey sampling needed a new paradigm to deal with the 
complexities of all kinds of populations for many survey 
variables and a growing list of survey statistics. This took 
the form of robust designs of selections and variance 
formulas that could use a multitude of sample designs, and 
gave rise to the new discipline of survey sampling. The 
computation of “design effects” demonstrated the existence, 
the magnitude, and the variability of effects due to the 
complexities of distributions not only for means but also for 
multivariate relations, such as regression coefficients. The 
long period of disagreements between survey samplers and 
econometricians testifies to the need for a new paradigm. 


4. COMBINING POPULATION SAMPLES 


Samples of national populations always represent 
subpopulations (domains) which differ in their survey 
characteristics; sometimes they differ slightly, but at other 
times greatly. These subclasses can be distinguished in the 
sample with more or less effort. First, samples of provinces 
are easily separated when their selections are made sepa- 
rately. Second, subclasses by age, sex, occupation, and 
education can also be distinguished, and sometimes used 
for poststratified estimates. Third, however, are those 
subclasses by social, psychological, and attitudinal charac- 
teristics, which may be difficult to distinguish; yet they may 
be most related to the survey variables. Thus, we recognize 
that national samples are not simple aggregations of indi- 
viduals from an IID population, but combinations of sub- 
classes from subpopulations with diverse characteristics. 
The composition of national populations from diverse 
domains deserves attention, and it also serves as an example 
for the two types of combinations that follow. Furthermore, 
these remarks are pertinent to combinations not only of 
national samples but also of cities, institutions, establish- 
ments, efc. 

In recent years two kinds of sample designs have 
emerged that demand efforts beyond those of simple 
national samples: a) periodic samples and b) multipopu- 
lation designs. Each of these has emerged only recently, 
because they had to await the emergence of three kinds of 
resources: |. effective demand supported by financial and 
political resources; 2. adequate institutional technical 
resources in national statistical offices; 3. new methods. In 
both types of designs we should distinguish the needs of the 
survey methods (definitions, variables, measurements), 
which must be harmonized, standardized, from sample 
designs, which can be designed freely to fit national (even 
provincial) situations, provided they are probability designs 
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(Kish 1994). Both types have been designed first and 
chiefly for comparisons: periodic comparisons and multi- 
national comparisons, respectively. But new uses have also 
emerged: “rolling samples” and multinational cumulations, 
respectively. Each type of cumulation has encountered 
considerable opposition, and needs a new outlook, a new 
paradigm. 

“Rolling samples” have been used a few times for local 
situations (Mooney 1956; Kish, Lovejoy and Rackow 
1961). Then they have been proposed several times for 
national annual samples and as a possible replacement for 
decennial censuses (Kish 1981, 1990). They are now being 
introduced for national sample censuses first and foremost 
by the US Census Bureau (Alexander 1999; Kish 1990). 
Recommending this new method, I have usually expe- 
rienced opposition to the concept of averaging periodic 
samples: “How can you average samples when these vary 
between periods?” In my contrary view, the greater the 
variability the less you should rely on a single period, 
whether the variation is monotonic, or cyclical, or 
haphazard. Hence I note two contrasting outlooks, or 
paradigms. Quite often, the opposition disappears after two 
days of discussion and cogitation. 


“For example, annual income is a readily 
accepted aggregation, and not only for steady 
incomes but also for occupations with high 
variations (seasonal or irregular). Averaging 
weekly samples for annual statistics will prove 
more easily acceptable than decennial 
averaging. Nevertheless, many investors in 
mutual stock funds prefer to rely more on their 
ten-year or five-year average earnings (despite 
their obsolescence) than on their up-to-date 
prior year’s earnings (with their risky 
“random” variations). Most people planning a 
picnic would also prefer a 50 year average 
“normal” temperature to last year’s exact 
temperature. There are many similar examples 
of sophisticated averaging over long periods 
by the “naive” public. That public, and policy 
makers, would also learn fast about rolling 
samples, given a chance.” 
(Kish 1998) 


Like rolling samples, combining multipopulation 
samples also encountered opposition: national boundaries 
denote different historical stages of development, different 
laws, languages, cultures, customs, religions, behaviors. 
How then can you combine them? However, we often find 
uses and meanings for continental averages; such as 
European birth and death rates, or South American, or 
sub-Saharan, or West African rates. Sometimes even world 
birth, death, and growth rates. Because they have not been 
discussed, they all usually combined very poorly. But with 
more adequate theory, they can be combined better (Kish 
1999). But first the need must be recognized with a new 
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paradigm for multinational combinations, followed by 
developing new and more appropriate methods. 


5. EXPECTATION SAMPLING 


Probability sampling assures for each element in the 
population (i = 1,2,...,N) a known positive probability 
(P,>0) of selection. The assurance requires some mecha- 
nical procedure of chance selection, rather than only 
assumptions, beliefs, or models about probability distri- 
butions. The randomizing procedure requires a practical 
physical operation that is closely (or exactly) congruent 
with the probability model (Kish 1965). Something like this 
statement appears in most textbooks on survey sampling, 
and I still believe it all. However, there are two questionable 
and bothersome objections to this definition and _ its 
requirements. 

The more important of the two objections concerns the 
frequent practical situations when we face a choice between 
probability sampling and expectation sampling. These 
occur often when the easy, practical selection rate for listing 
units of 1/F yields not only the unique probability 1/F for 
elements, but also some with variable k./F for the ith 
element (i =1,2,...,N) and with k.>0. Examples of 
k,>1, usually a small integer, occur with duplicate or 
replicate lists, dual or multiple frames of selection, second 
homes for households, mobile populations and nomads, 
farm operators with multiple lots. Examples of k,< 1 are 
selecting a single adult from households, selecting single 
dwellings from buildings. In these examples often the k, 
can be easily ascertained, and it is cheaper, more convenient 
and economical to use weighting than attempting to obtain 
1/F for all the elements. These problems are described in 
books and articles. 

In most cases, we find it more convenient and less 
expensive to accept the variable probabilities and to counter 
them with weighting the expected values 1/k, or k, than to 
operate another stage of selection. Thus, to paraphrase 
probability sampling: expectation sampling assures for each 
element in the population (i = 1, 2, ..., N) a known positive 
expected number of selections (k,/F'>0). These procedures 
are used in practice for descriptive (first order) statistics 
where the k, or 1/k, are neither large nor frequent. The 
treatments for inferential — second order or higher — 
statistics are more difficult and diverse, and are treated 
separately in the literature. Note that probability sampling 
is the special (and often desired) situation when all k, are 1. 

The other objection to the term probability sampling is 
more theoretical and philosophical and concerns the word 
“known” in its definition. That word seems to imply belief. 
Authors from classics like John Venn and M.G. Kendall to 
modern Bayesians like Dennis Lindley — and beyond at 
both ends — have clearly assigned “probability” to states of 
belief and “chance” to frequencies generated by objective 
phenomena and mechanical operations. Thus, our insistence 
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on operations, like random number generators, should 
imply the term “chance sampling.” However, I have not 
observed its use and it also could lead to a philosophical 
problem: the proper use of good tables of random numbers 
implies beliefs in their “known” probabilities. I have spent 
only a modest amount of time on these problems and 
agreeable discussions with only a few colleagues, who did 
agree. I would be grateful for further discussions, 
suggestions and corrections. 


6. SOME RELATED TOPICS 


We called for recognition of new paradigms in six 
aspects of survey sampling, beginning with statistics itself. 
Finally, we note here the contrast of sampling to other 
related methods. Survey methods include the choice and 
definition of variables, methods of measurements or obser- 
vations, control of quality (Kish 1994; Groves 1989). 

Survey sampling has been viewed as a method that 
competes with censuses (annual or decennial), hence also 
with registers (Kish 1990). In some other context, survey 
sampling competes with or supplements experiments and 
controlled observations, and clinical trials. These contrasts 
also need broader comprehensive views (Kish 1987, section 
A.1). However, those discussions would take us well 
beyond our present limits. 
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Still Rolling: Leslie Kish’s ‘Rolling Samples” 
and the American Community Survey 


CHARLES H. ALEXANDER’ 


ABSTRACT 


Leslie Kish long advocated a “rolling sample” design, with non-overlapping monthly panels which can be cumulated over 
different lengths of time for domains of different sizes. This enables a single survey to serve multiple purposes. The Census 
Bureau’s new American Community Survey (ACS) uses such a rolling sample design, with annual averages to measure 
change at the state level, and three-year or five-year moving averages to describe progressively smaller domains. This paper 
traces Kish’s influence on the development of the American Community Survey, and discusses some practical 
methodological issues that had to be addressed in implementing the design. 


KEY WORDS: Rolling sample; Multi-year averages; Asymmetrical cumulations. 


1. INTRODUCTION 


A “rolling sample design”, defined below, gives a single 
survey the flexibility to serve multiple purposes. The 
concept was developed by Leslie Kish in a series of papers 
(including Kish 1979a, 1979b, 1981, 1983, 1986, 1990, 
1997, 1998 and Kish and Verma 1983, 1986) in which he 
elaborated the principles of cumulating information over 
space and time from a rolling sample. Kish advocated its 
use for a variety of purposes (Kish 1998), especially in 
developing countries (Kish 1979b), but also in the context 
of the U.S. census (Kish 1981). His personal use of rolling 
samples goes back at least to 1958, under the name 
“continuous sampling” (Kish, Lovejoy and Rackow 1961); 
a still earlier project (Mooney 1956) is cited in Kish (1998). 

The American Community Survey (ACS), which is 
being developed as a replacement for the traditional “long 
form” survey conducted as part of the census, will use a 
form of the rolling sample design. This paper describes how 
the rolling sample concept is being implemented for the 
ACS, influenced by its specific objectives and operational 
considerations. The design decisions made for the ACS 
illustrate some issues that may arise for rolling samples in 
general. They also illustrate how Leslie Kish influenced 
survey development on multiple levels: philosophical, 
personal, and practical. 


2. ROLLING SAMPLES 


A “rolling sample” design jointly selects k non- 
overlapping probability samples (panels), each of which 
constitutes 1/F of the entire population. One panel is 
interviewed each time period until all the sample has been 
interviewed after k periods. Depending on the precision 
requirements, a single panel of 1/F may be sufficient to 
provide good estimates for the population as a whole, and 


possibly for some large domains. For smaller domains or 
for greater precision for large domains, cumulations of 
different numbers of consecutive panels can be used, up to 
k/F of the population. A rolling sample design with k=F is 
called a “rolling census”. For a monthly rolling sample, it is 
natural to have F be a multiple of twelve, and natural 
cumulations are quarterly, semi-annual, annual, and 
multiple years. 

“Domains” include both geographic areas and demo- 
graphic subgroups. Kish (1987, section 2.3) presents a 
framework for the tradeoff between geographic and 
demographic detail, for a given required level of precision. 
Even more central to the idea of rolling samples was the 
idea of “asymmetrical cumulation” of data, over different 
lengths of time for different sizes of domain (Kish 1990, 
1998), which was later broadened into a view of the basic 
similarities of averaging over space and averaging over time 
(Kish 1998), as well as averaging over different demo- 
graphic domains. The flexibility of the rolling sample 
design comes from the opportunities it provides to make 
different tradeoffs between spatial, temporal, and demo- 
graphic detail. 

Leslie Kish left his colleagues with a challenge to extend 
these ideas into a “theory of combining populations” (Kish 
1999, 2001). He organized a contributed paper session on 
“combining surveys” at the 1999 meetings of the Inter- 
national Statistical Institute, explaining to the presenters 
that we were all working on different aspects of the same 
problem, whether we knew it or not. The scope of this 
problem includes various forms of cumulation of data from 
rolling samples, as well as the question of how to combine 
data from different countries into statistics for larger entities 
such as the European Union. Kish (2001) suggests that 
these problems have fundamental features in common with 
the problem of combining information from different 
experiments (Cochran 1937, 1954). 


: Charles H. Alexander, U.S. Bureau of the Census, Suitland, Maryland, U.S.A. 20233. 


36 Alexander: Still Rolling: Leslie Kish’s “Rolling Samples” and the American Community Survey 


3. THE CENSUS LONG FORM AND 
INTERCENSAL ALTERNATIVES 


The decennial census “long form” survey is the main 
source of subnational data about the characteristics of the 
U.S. population and housing. Estimates of the number of 
people and housing units come from the “short form” part 
of census administered to all households. With an overall 
sampling rate of one-in-six, the long form survey provides 
precise, detailed (“Precise” refers to the sampling error, and 
“detailed” means that estimates are given for many demo- 
graphic domains within the geographic domain.) estimates 
of a variety of demographic and economic characteristics 
for individual states, large cities, and large counties or 
groups of counties. It provides useful, though less precise 
and less detailed, estimates for even very small areas such 
as small towns and Indian Reservations, as well as census 
tracts, which average about 4,000 population. For the 
smallest governmental units, higher sampling rates are used, 
as high as one-in-two for the smallest places, so that there 
are usable estimates for these areas. To compensate for the 
higher sampling rates in these areas, the rate is one-in-eight 
in the largest tracts. 

Between the censuses, the federal government’s statis- 
tical programs provide relatively little information about the 
characteristics of the population below the national level. 
The basic census counts are updated by an intercensal 
demographic estimates program, but other demographic and 
economic characteristics are available mainly from national 
surveys. The Current Population Survey (CPS), the U.S. 
monthly labor force survey, has about a one-in-1000 
sampling rate with substantial overlap in the sample units 
from one month to the next so that the sample cannot be 
profitably cumulated over time as a rolling sample can. A 
March Supplement to the CPS collects additional infor- 
mation once a year, providing estimates for income and 
poverty at the state level, but with limited precision and 
demographic detail. There are programs which use 
modeling methods based on administrative records to make 
small-area estimates for unemployment, and for income and 
poverty, but not for a variety of characteristics. 

The need for more frequent information for smaller 
domains (or “communities”) has long been recognized 
(Hauser 1942; Eckler 1972, page 212; Bounpane 1986). 
Leslie gave credit to his friend, Philip Hauser, for proposing 
an “annual sample census” in 1941. Kish (1981) proposed 
a rolling sample as a way to meet this need, presenting 
several options including a rolling sample for the CPS. 
Instead a mid-decade census was authorized for 1985, but 
it was never funded. Nor was a proposal to double the size 
of the CPS (Tupek, Waite and Cahoon 1990). 

Interest at the Census Bureau in intercensal information 
about population characteristics was revived by a proposal 
for a “Decade Census Program” advanced by Herriot, 
Bateman and McCarthy (1989). This program would have 
collected data in different states in different years; 


ultimately this proposal did not gain acceptance. However, 
Roger Herriot’s energetic and eloquent advocacy of the 
importance and potential value of intercensal subnational 
data created awareness in federal statistical agencies of the 
possibility of a “new paradigm” for the decennial cycle of 
data collection. Awareness of Kish’s rolling sample 
proposal was definitely a factor during this period, as the 
Bureau considered new approaches for the 2000 census (see 
Bounpane 1986). 

There was renewed Congressional interest in intercensal 
characteristics data (Melnick 1991; Sawyer 1993), and a 
“continuous measurement” alternative to the census long 
form was considered as part of the research for Census 
2000, starting in 1992. Kish’s rolling sample design was 
eventually proposed for this purpose because it provided 
flexibility in making estimates, as well as the potential for 
efficient data collection (Alexander 1993, 1997; National 
Academy of Sciences 1994, 1995). My recollection is that 
the most influential articles were Kish (1981, 1990), and 
that Kish and Verma (1983, 1986) were also consulted. 
“Continuous Measurement” was later renamed the 
“American Community Survey (ACS)”. 

The proposed ACS was not adopted for Census 2000, 
but after limited testing during 1996-1998, the ACS metho- 
dology was implemented in 36 counties for the years 1999- 
2001, so that ACS results could be compared to the 2000 
census long form data. There was also a large-scale test in 
2000, for a state-representative annual sample of about 
700,000 addresses called the Census 2000 Supplementary 
Survey, of collecting long-form data separately from the 
census, using the ACS questionnaire. In 2001 and 2002, the 
Supplementary Survey is being continued, as part of the 
transition to the ACS. 


4. THE PLANNED AMERICAN COMMUNITY 
SURVEY 


The ACS will start in 2003, if funded by Congress, with 
a monthly sample of about 250,000 addresses, a new panel 
of addresses starting each month. This corresponds to a 
monthly rolling sample with an average rate of approxi- 
mately F = 480 or an annual sample with F = 40. The 
survey will use k = 60, with the shortest published cumu- 
lation being calendar year estimates. The ACS will be 
conducted by mail, with nonresponse followup by tele- 
phone. A random sample of one-third of the remaining 
nonrespondents will be selected for followup in person. 

For domains with average response rates, with a monthly 
F = 480, the standard errors for a 5-year average estimate 
from the ACS will be somewhat larger than for a corres- 
ponding estimate from the census long form, typically on 
the order of 1.33 times as large. This was judged to be 
“sufficiently close” for most purposes, given the advantage 
of timeliness and the expected lower missing data rates due 
to having a permanent staff of interviewers. In areas with 
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lower-than-average mail response rates, the subsampling for 
nonresponse follow-up will reduce the effective sample 
size. This happens not only because the number of inter- 
views is reduced, but also because the unequal weights 
typically lead to a higher design effect (Kish 1965, pages 
429-431). To compensate for this, the ACS will probably 
use a higher nonresponse subsampling rate in low-response 
areas, balanced by a lower sampling rate in areas with 
higher-than-average mail response. The details of this are 
still being determined. There also will be an oversample of 
addresses in small governmental units, as with the census 
long form sample. 

An important development in the last decade, that made 
the ACS possible, (Kish (1981) suggests an alternative 
approach of “cumulative rolling listings”, but this would be 
quite expensive for making regular estimates for all of the 
smallest domains, such as census tracts.) is the Census 
Bureau’s program to maintain an ongoing Master Address 
File (MAF), linked to our TIGER geographic database. The 
main source of address updates throughout the decade is the 
Postal Service’s Delivery Sequence File (DSF). The Bureau 
is implementing a MAF/TIGER modernization program 
that will augment the DSF updates with new addresses from 
data files provided by local governments, and from other 
administrative sources. This will be supplemented by new 
addresses encountered by interviewers from the ACS and 
other surveys in more rural areas. The monthly samples are 
actually generated by selecting an annual sample from the 
MAF in the previous September, and dividing it into 12 
monthly panels. In February, there is a supplemental sample 
of new units from the DSF, spread across the remaining 
months of the year. 

Replacing the 2010 census long form, by the ACS, is one 
component of a program to re-engineer the 2010 census. 
This also includes the modernization of MAF/TIGER, as 
well as a program of early research and testing to automate, 
streamline, and improve the census operations for 2010. 
This combination of improvements is expected to have a 
budgetary cost for the full 10-year cycle that is less than the 
cost of repeating the Census 2000 methods in 2010. This is 
a quite different plan than the vision of ACS described in 
National Academy of Sciences (1994, Chapter 6; 1995, 
Chapter 6), where I expressed hopes that eliminating the 
long form by itself, without other fundamental improve- 
ments, might save enough to pay for the ACS. 


5. SOME VARIATIONS ON THE BASIC DESIGN, 
AND SOME ISSUES 


5.1 Miulti-stage Cluster Samples 


The ACS uses an unclustered one-stage systematic 
sample, because the goals include providing data for all 
small geographic domains, such as tracts or block groups, 
each year. From discussions in Kish (1981, 1998), it is clear 
that rolling samples can also use cluster samples and 
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multiple stages of selection, as well as varying probabilities 
of selection. However, to qualify as a “rolling sample”, the 
primary sampling units themselves must be a rolling 
sample. A design with a fixed set of primary sampling units 
(PSUs), with a rolling sample within each PSU, is a 
“cumulated representative sample” (Kish 1998). 

Leslie was emphatic that the proposal by Herriot et al. 
(1989), was not what he meant by “rolling sample”. 
However, it would seem to fit the definition as stated in 
Section 2, if the states are considered as PSUs. I think this 
demonstrates that there is an implicit requirement that a 
rolling sample must yield a useful representative probability 
sample in each time period, for each geographic domain of 
interest; this additional requirement does not hold if the 
PSUs are states. This requirement means that the clusters or 
PSUs need to be substantially smaller than the smallest 
domain of interest. (See Kish 1998, page 38.) 


5.2 Differential Sampling Rates 


Kish (1998, section 4) notes that a rolling sample can use 
different sampling fractions in different strata. This can get 
complicated, especially if the sampling fractions change 
over time, because the conditional probability of selecting 
a unit (without replacement) for the j™ panel in the h™ 
stratum depends on the sampling rates used in the previous 
panels in that stratum. This is even more complicated if the 
strata change over time, for example as the boundaries of 
governmental units change. 

To simplify this for the ACS, we select the sample in two 
stages. The first stage selects a rolling “super sample” using 
a constant sampling rate for each panel and each year, equal 
to the largest sampling rate required in any stratum. The 
second stage subsamples the initial sample, to give the 
desired sampling rate for each stratum for that year. The 
selection of subsequent samples, which avoids overlap with 
the entire previous supersamples, needs only to keep track 
of the sampling rate for the first stage. 


5.3. Updates to the Frame 


In practice, the population is a little different for each 
panel. New addresses are added to the frame. Some old 
addresses cease to exist; they may be removed from the 
address list, or they may stay on the list and be deleted only 
after attempts to contact them. This presents no funda- 
mental conceptual problem. It does mean that a “rolling 
census” would not necessarily contact every population unit 
that ever exists, since some units may go in and out of 
existence too quickly to fall into sample. 

To avoid record-keeping of different conditional 
sampling rates for different “cohorts” of addresses which 
were added during Master Address File updates at different 
times, we have found it convenient to assign artificial “back 
samples” by selecting addresses from each set of new 
addresses not only for the current panel, but for past panels. 
These units are not interviewed, since the times for their 
assigned panels are past, but they are avoided during the 
without-replacement selection of future panels. 
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5.4 What Happens After Panel k? 


One question Leslie did not address explicitly, as far as 
I know, is how to draw the sample for panel k+1. I think he 
assumed that panel k + 1 would be the same as panel 1, 
panel k + 2 repeats panel 2, and so forth. This works fine 
for a simple random sample, but not so well for a systematic 
sample intended to spread the sample over a geographically 
sorted list, because as the frame changes over time, panel 1 
doesn’t keep its even spacing. 

Our plan is to select panel k + 1, and future panels, as a 
fresh systematic sample. Each one will avoid overlap with 
the previous k - 1 panels, so there will always be k conse- 
cutive non-overlapping panels, but we won't worry about 
overlapping with panels before that. 


5.5 Questionnaire Reference Date, Given an 
Extended Interview Period 


The interviews from each monthly ACS panel take place 
over a three-month period, allowing two months for mail 
returns and telephone followup before starting the more 
expensive personal visits in the third month. Thus, the data 
actually collected in June consist of early mail returns from 
the June panel, late mail returns and telephone interviews 
from the May panel, and personal-visit followup cases from 
the April panel. This raises the issue of whether to ask the 
survey questions as of the time the survey was mailed out — 
the best choice as far as sampling bias — or as of the time 
the questions are asked — the best choice as far as response 
error and other nonsampling errors, especially for people 
who have moved from the address. 

Taking these quality tradeoffs into account, we chose to 
use a “current” reference date, collecting the characteristics 
of the household members at the time of interview. One 
reason for this decision is that we think the nonsampling 
errors will be harder to evaluate than the sampling bias. 
Also the sampling biases in the monthly estimates will tend 
to cancel over the course of the year. This is one reason for 
limiting the ACS to annual and multiple-year estimates. 


5.6 Use of Intercensal Population Estimates as 
Survey Weighting Controls 


The Census Bureau has a program of “intercensal” 
(Leslie would call these “post-censal” estimates, reserving 
“intercensal” for estimates between two censuses that have 
been completed.) demographic estimates, based on demo- 
graphic models. These models update the previous census, 
using vital records and other administrative records 
information. These estimates are used as independent 
weighting controls, or “post-stratification factors”, for most 
national household surveys (see Kish 1965, pages 90-92). 
Adjusting the survey weights to agree with controls can 
reduce the variances of survey estimates, adjust for 
differences in coverage by age, sex, race, or Hispanic 
origin, and improve consistency across surveys. The census 
long form similarly uses the census counts as controls in its 
weighting. 


The weighting controls have traditionally not been 
available for the smallest geographic domains, at least not 
with the demographic detail available for larger areas. 
Plans to produce more detailed controls for use in ACS 
weighting are described in Alexander and Wetrogan (2000). 
Some improvements will come from improved sources of 
administrative data, but in addition the ACS itself will 
provide information on changes in the population, which 
can be incorporated into the demographic models. The 
problem is complicated by the differences between the 
“current resident rule” used in the ACS and the “usual 
resident rule” used in the census; the ACS includes a 
question about part-year residents to help in adjusting for 
this difference. To facilitate this integration of survey data 
and demographic models, and especially to develop error 
measures for the resulting estimates, the Census Bureau is 
trying to develop “statistical” versions of the demographic 
models used in producing the intercensal population 
estimates. The inspiration for this effort to blend the 
statistical and demographic approaches is Purcell and Kish 
(C1979): 


6. DIFFERENT CUMULATIONS FOR 
DIFFERENT PURPOSES 


For the main ACS objective, to replace the census long 
form as a source of detailed descriptive statistics, we plan 
to use 5-year ACS cumulations, for a data product similar 
to traditional long form “summary files”. This is the 
shortest time period for which the ACS sampling error is 
judged to be reasonably close to that of the census long 
form. All sizes and types of geographic areas would be 
included on these 5-year data files. For allocating 
government funds based on an assessment of current need 
for the funds, simulations suggest that 3-year cumulations 
may be preferable to the 5-year, sacrificing precision for 
greater recency (Alexander 1998). 

For individual areas, the most prominently published 
data will be one-year averages for areas greater than 65,000 
population, and 3-year averages for areas greater than 
20,000, in addition to the 5-year averages for all areas. 
Annual average estimates for areas below these thresholds 
will be available for more “sophisticated” uses to use in 
time series models, and to indicate large variations within 
the multi-year averages, but will not be as prominently 
displayed in our publications or on our websites. 

These planned published ACS data products are 
designed to encourage analysts to use the same length of 
cumulation when comparing areas of different sizes, on the 
grounds that to do otherwise may be perceived as unfair to 
smaller jurisdictions. In doing this, we have accepted the 
notion of “asymmetrical cumulations” as far as levels of 
geography, but not necessarily within the same level of 
geography. For example, we would use one year for 
comparing states, but would recommend 5-years for all the 
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counties in a table comparing large and small counties. In 
this latter recommendation, we differ somewhat from Kish 
(1998, pages 42-43) which would let us use tables of 
counties with one-year estimates for large counties, 3-year 
averages for medium-sized ones, and 5-year averages for 
small ones. It will be interesting to see what practices data 
users will adopt in this regard. 


7. WEIGHTING THE YEARS IN MULTI-YEAR 
CUMULATIONS 


Kish (1998) points out that there are a number of choices 
for weighting multi-year cumulations. If there are ten yearly 
means y,, then there are many choices of jy =)’ w, y,, with 
ya w, = 1, touse as the ten-year cumulations. 

For the ACS 5-year and other multi-year cumulations, 
discussed in section 6, our plans are to give the years equal 
weights in the standard published data products, e.g., 
w,=0.2 for the 5-year average. This was an area of 
disagreement with Kish (1998), which gently urges us to 
consider of alternatives, in particular weights of the form 
W;,, = Cw;, with C>1. 

An underlying issue in thinking about unequal weights 
is what statistical problem we are trying to solve. Using the 
2003 — 2007 cumulation as an example, is the goal: 


— to provide a “direct design based” estimate for the 
2003 — 2007 historical average; 


— to provide a “model-based” estimate for the 2007 
value; or 


— to provide a “direct, design-based” estimate for a 
weighted 2003 — 2007 historical average, with more 
weight on recent years? 


To interpret the 2003 — 2007 estimate as an estimate for 
2007 requires a model or assumptions about the time series 
for the area. The problem may be viewed as combining a 
direct estimate for 2007 with a forecast for 2007 based on 
the years 2003, ..., 2006, with the requirement that the same 
formula be used for all areas and all characteristics to 
preserve additivity in the tables and comparability across 
tables. 

I have previously interpreted the decision as a choice 
between the first two goals, and have shied away from the 
second approach for the ACS, ultimately because of the 
concerns expressed in Hansen, Madow and Tepping (1983, 
sections 3 and 5.5) about using model-based estimates for 
general-purpose “official statistics”. With the variety of 
Statistics and geographic areas covered by the ACS, there 
inevitably will be some where the compromise model fails 
badly; a data user may be unaware of this failure, or may be 
very aware. In what sense can the compromise average be 
viewed as a valid estimate for 2007 when the compromise 
model clearly fails, and what measure of error would be 
associated with it? With this view of the issue, we have 
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recommended using the unweighted multi-year averages as 
the standard general-purpose data product, with the time 
series of annual estimates being available for use in time 
series models for specific applications, and for interpreting 
the multi-year averages when there is variation within the 5- 
year period. 

However, upon rereading Kish (1998), I now interpret 
his view of the weighted average to be the third formu- 
lation, a design-based estimator of a more up-to-date popu- 
lation parameter. This avoids the concerns about model fit 
for general-purpose uses, although there is still the question 
of how to justify and achieve a consensus solution. Also, 
the unequal weights tend to increase the standard errors of 
the multi-year averages. But Kish (1998, page 40) will get 
the last word on the subject: 


“Important questions remain for further 
discussions and research. Perhaps forever, and 
this can become a ‘growth industry.’ ” 


8. NOT COMBINING THE CPS AND THE ACS 


Leslie often said he was pleased to see his idea being 
implemented in the ACS, but I think he was disappointed 
that we did not try to replace both the census long form and 
the CPS with one survey. By contrast with some other 
issues where we had lively discussions, Leslie took a 
“hands off’ stance on this issue. I think he viewed this as a 
decision about quality tradeoffs, which the government 
agencies had to work out for ourselves. There were two 
main reasons for our decision: 


We cannot adequately measure the monthly unemployment 
rate with a mail survey. Correct measurement of the 
unemployment rate requires complex questions that would 
not be feasible to ask by mail, for example, to probe to be 
sure that someone who is “looking for work” did conduct 
an active job search. (See Butani, Alexander and Esposito 
1999). The Census 2000 Supplementary Survey, using the 
ACS procedures, dramatically overestimated the 2000 
national unemployment rate (5.3 percent versus 4.0 percent 
in the CPS). A similar difference was seen in the 1990 
census. 

A mail survey would lag substantially in producing 
monthly rates, compared to the CPS. In addition, the 
impossibility of completing all the mail interviews for a 
panel in the designated month introduces biases in monthly 
estimates (see section 5.5 above). These problems would be 
reduced somewhat for quarterly moving averages instead of 
monthly estimates, which Leslie frequently suggested (for 
example Kish 1999), but the monthly unemployment report 
is an indispensable economic indicator in the U.S. 

It is too expensive to replace the long form without using 
mail. A rolling sample survey, conducted in person with a 
large enough sample to replace the long form, would have 
to be 3 or 4 times as large as the CPS. This is a function of 
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the size of the U.S. population, and the number of tract- 
sized domains for which estimates are required from the 
long form. Such a survey would be much more expensive 
per case than the CPS, because it could not use a cluster 
sample or telephone interviews for repeated interviews of 
the same households, as does the CPS. The total cost of 
such a survey would be several times as great as the 
combined cost of the proposed ACS and the CPS. 

Because it is designed so narrowly as a long form 
replacement, the ACS does not illustrate the full range of 
flexibility that Leslie envisioned from a rolling sample. 
Under different circumstances, for a smaller population, 
with less need for very small domains from the “long form 
survey”, or less strict requirements for timing and questions 
for the labor force survey, it might be possible for a labor 
force survey with a rolling sample to meet the demands for 
small domain data. With the further addition of a split panel 
or other components (Kish 1998, pages 40-41) an even 
wider range of objectives could be met. 


9. CONTRIBUTIONS: PHILOSOPHICAL, 
PERSONAL, AND PRACTICAL 


The long list of articles by Leslie Kish on the subject of 
rolling samples clearly demonstrates the intensity and 
tenacity of his campaign for what he understood as an 
important idea. The evolution of the idea over the course of 
these papers also illustrates the depth of his attention to 
“philosophical” questions about the fundamental quality 
objectives for a survey: What are we trying to do? How 
does the choice of survey design relate to what we are 
trying to do, and why? This kind of guidance is crucial at 
the start of a survey program, when the “big questions” are 
being addressed, and makes the difference between ideas 
that quickly fall by the wayside and those that are “still 
rolling”. 

Leslie’s personal support of other statisticians went far 
beyond his papers. Though I was by no means one of his 
closest colleagues, he regularly provided personal advice or 
encouragement when he sensed it was needed. The “still 
rolling” in this paper’s title was the title I used in e-mail 
messages to him when I had news about the ACS’s perilous 
passage through the annual budget cycle, which was most 
of the time. He would respond briefly by e-mail, but 
important messages always came in the form of handwritten 
letters. 

Finally, based on these papers, it is clear that Leslie was 
always a practical person, even at his most philosophical, 
and that his papers cannot be fully appreciated without 
knowing what was going on in the survey world when he 
wrote them. Looking back over his rolling sample papers, 
I can see many comments, about both details and general 
principles, that were aimed at enlightening specific 
decisions that the Census Bureau needed to make at the 
time. I would guess that throughout his work, there are 


specific messages to help out someone somewhere in the 
world who faced a practical design decision at the time. 
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Redesign of the French Census of Population 


JEAN-MICHEL DURR and JEAN DUMAIS' 


ABSTRACT 


Census-taking by traditional methods is becoming more difficult. The possibility of cross-linking administrative files 
provides an attractive alternative to conducting periodic censuses (Laihonen 2000; Borchsenius 2000). This was proposed 
in a recent article by Nathan (2001). INSEE’s redesign is based on the idea of a “continuous census,” originally suggested 
by Kish (1981, 1990) and Horvitz (1986). A first approach that would be feasible in France can be found in Deville and 
Jacod (1996). This article reviews methodological developments since INSEE started its population census redesign 


program. 


KEY WORDS: Balanced sampling; Census; Continuous census; Calibration. 


1. INTRODUCTION 
1.1 Reasons for the Redesign 


France has been conducting censuses for many years to 
measure the de jure population of its administrative districts 
and to describe the socio-demographic characteristics of its 
territory at all levels of geography, from districts of 
communes to the country as a whole. The 1999 census was 
conducted in the usual manner: delivering and retrieving 
questionnaires by census interviewers, organisation, tech- 
nical assistance and control by INSEE, execution by the 
Mayor as the state representative. For various reasons, how- 
ever, we decided to re-examine the census. 

First, the interval between censuses has a tendency to 
increase in length. Indeed, the periodicity of censuses is not 
covered by laws, and each census date is determined by a 
statutory order. Before the war, censuses were taken every 
five years; then the gap grew to seven years, then eight, the 
last census, originally planned for 1997, was postponed 
until 1999 for budgetary reasons, that is, 9 years after the 
previous census. Moreover, the public does not always 
understand the need for such a massive operation at a time 
when the number of administrative files is increasing, even 
though that same public has expressed serious concerns 
about the cross-referencing of such files. In addition, the 
decentralization that has been going on in France for over 
20 years has generated numerous requirements for statistics 
in support of local policy-making. As the supreme source of 
local information, the census had to adapt to these changes 
and provide fresher yet still highly detailed data. 

As a result, a population census redesign program was 
established at INSEE in the late 1990s. Since France has no 
population register and, in view of the circumstances, 1s 
unlikely to institute one, the decision was made to consider 
a compromise solution that would combine annual sample 
surveys with the use of non-nominative administrative files 
that INSEE is authorized to use solely for statistical 


purposes. Communes whose population is below a certain 
threshold (10,000 for the moment) will be covered by 
annual take-all surveys with a rotation period of five years. 
For the other communes, a sample survey will be conducted 
each year, with the entirety of the commune being covered 
within the same five-year rotation period. To carry out this 
redesign, a new legal framework was needed. The project 
was submitted to the Conseil d’Etat, which recommended 
on July 2, 1998, that the government table draft legislation 
in Parliament. 

Aside from the need to strenghten the census legal basis, 
the Conseil was of the view that since population counts 
were referred to in over 200 statutes or regulations, making 
a major change in the way they were produced would 
require legislative approval. Within this framework, the 
purpose of the legislation was essentially to set out the 
principles and rules governing the organization of the 
census. 

The operation was placed under State responsibility and 
control: INSEE was to establish the collection framework 
(concepts, protocols), select the samples, ensure the quality 
of the information collected, and process and disseminate 
the data. The communes as local organisations, were to 
prepare and conduct the census surveys. The State would 
provide financial assistance to cover the costs. These 
arrangements clarity the role and responsabilities of each of 
the partners. 


1.2 Quality Goals 
The program has the following quality goals: 


1.2.1 Data Quality 


Timeliness: The goal is to be able to disseminate by the 
end of year A the de jure population of all administrative 
districts as at January | of year A-2; a statistical description 
of all geographic units (communes and commune groups, 
districts of major cities, lands, etc.) as of January | of year 
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A-2; and a statistical description of France and its major 
geographic units (regions, etc.) as of January | of year A. In 
comparison with the general census, the redesigned census 
will produce similar population and housing data an 
average of three to four years earlier. 

Relevance: The data produced must be relevant to local 
needs. In particular, data that are worth studying only at 
levels of geography far above the commune will be set 
aside in favour of data that are more useful for local 
purposes. What data will be collected will be determined by 
the Conseil national de l'information statistique (CNIS), 
whose membership includes representatives of various 
categories of producers and users of public statistics. A 
CNIS working group has proposed changes while at the 
same time preserving the necessary continuity with previous 
censuses and limiting the response burden. 

Precision: The census must provide data that are 
meaningful for all levels of geography in France. The data 
produced must be sufficiently precise, even at the sub- 
communal levels, for the most useful cross-tabulations at 
those levels. This means, in particular, distributions by sex 
and age, by type of activity and socio-professional category, 
and by type of housing. It must be possible to estimate the 
precision of the data, and users must be informed of that 
precision. 

User-friendliness: To avoid annoying users, the data 
produced must be easy to understand and comparable in use 
to data produced by a general census. 


1.2.2 Process Quality 


Response burden: To limit the response burden for the 
public, the amount of information collected must be kept to 
a bare minimum. In particular, information available for the 
same level of geography from other sources will not be 
collected in the census unless it can be used to produce 
useful cross-tabulations with other variables. As in previous 
censuses, the personal questionnaire will be confined to one 
double-sided sheet of paper. 

Questionnaire: Since collection is by the drop-off/ 
pick-up method, the questionnaires must be universally 
accessible. To ensure that the questions will be understood, 
qualitative testing was conducted using focus groups. In 
addition, a collection test was carried out on 4,000 
dwellings in the first half of 2001. 

Confidentiality: Data gathered in the census are 
protected by law. Personal information collected in the 
census can be accessed only by authorized persons. The 
data are for INSEE and can be used only for statistical 
purposes. Only data essential to the preparation and conduct 
of census surveys are shared with communes or commune 
groups, on a need-to-know basis. 

Quality of coverage: The coverage of general censuses 
was not systematically evaluated. Following the 1990 
census, a postcensal survey indicated that the rate of under- 
coverage was about 1.8% and the rate of overcoverage was 
about 0.9%, for an overall precision of roughly 0.9%. The 


largest undercounts were in large agglomerations. By 
conducting an annual sample survey in communes with a 
population of more than 10,000 and thereby reducing the 
number of people to be covered in the census, we will be 
able to focus our efforts on obtaining answers from 
respondents. The coverage of the redesigned census will be 
evaluated on a regular basis through comparison with 
administrative data and through special surveys. 


Technical and organizational robustness: Because of 
the volume of data processed and the importance of the 
census, the program must be based on tried and true 
technical innovations. Furthermore, the robustness of the 
census apparatus must be evident in the launch of the opera- 
tion. Technical or functional innovations can be introduced 
at any time in the census cycle as part of evolving mainte- 
nance or specific projects. The annual surveys can be used 
to test the effectiveness of such projects before they are 
applied to the entire process. However, major changes such 
as questionnaire updates will generally be made only for the 
beginning of a five-year cycle. The organization of the 
census will depend on a balanced partnership between 
INSEE and the communes. INSEE must be capable of 
building the proposed structure within its budget and its 
work program by reorganizing its operations. Similarly, the 
communes and intercommunal cooperation bodies must be 
able to support the census organization. The yearly cycle of 
surveying large communes and the option that small and 
medium communes will have of delegating collection to an 
intercommunal body are likely to promote the professional- 
ization of collection workers. 

With the integration of census operations into the annual 
work program of the regional offices, and the fact that the 
Operation is one-seventh the size of the general census, 
INSEE will have tighter control of the census. Instead of 
having 110,000 census agents collecting data from 60 
million people in 36,700 communes in a particular year, it 
will have only 18,000 agents visiting roughly 9 million 
residents in about 8,000 communes. 

The division of responsibilities between INSEE and the 
communes, the resources that the communes will require, 
and the validation processes for the various stages will be 
set out in a decree. 

Cost control: With the five-year collection cycle, the 
financial burden of conducting the census can be spread 
over a longer period. For communes with a population of 
more than 10,000, the cost of the redesigned census will be 
lower than the cost of the current census of population. On 
the other hand, for communes with fewer than 10,000 
residents, the cost should be equal to that of a general 
census, but it would be every five years instead of the 
roughly eight-year cycle of the general census. The cost of 
the redesigned continuous census will be equivalent to one 
seventh of that of a general census. This will contribute to 
archive the reform without budget increase. However, a 
slightly larger budget in the first few years would help to 
iron the kinks out of the collection process. 
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2. SAMPLING STRATEGY 


The commune is the linchpin of the redesign effort. The 
set of “small and medium-sized communes” (those with a 
population of less than 10,000) will be sampled at an 
average rate of 20% a year, and all their dwellings will be 
visited; all “large communes” will be visited annually, but 
only a fraction of their dwellings will be surveyed. 


2.1 Small and Medium-sized Communes 


Let’s start with “small and medium-sized communes”. In 
each region, five rotation groups of communes will be 
formed using data from the 1999 population census. They 
will consist of balanced samples (Deville and Tillé 1999, 
2000) of the age-sex distribution of the communes’ popu- 
lation. This approach should help minimize year-to-year 
variation due to sampling. 
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Figures 1 and 2 show how balanced the five rotation 
groups will be. They contain box-and-whisker diagrams of 
two variables measured in the 2,811 small and medium 
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communes in Rhéne-Alpes in the 1990 population census. 
For each rotation group, both the quartiles and the range of 
the distribution are shown. It is interesting to note how 
similar the charts are. The “number of women aged 20 to 
39” variable was used to form the groups. Neither the 
number of principal residences nor any of the household or 
dwelling variables plays a part in the balancing. 

Each year, the population and housing in all the 
communes in one rotation group will be fully enumerated. 
Hence, each “small and medium commune” will be 
completely enumerated once every five years, and a fifth of 
all the “small communes” will be covered each year. 


2.2 Large Communes 


The “large commune” sample will be based on the 
“répertoire d’immeubles localisés” (RIL) (inventory of 
located buildings). The RIL is a list of buildings 
(residential, institutional or commercial) identified indivi- 
dually so as to generate a digitized map. Initially, the RIL 
will be populated with data from the 1999 census, which 
will provide a statistical portrait of each residential 
building. (In the 1999 census, a building is defined as the 
set of dwellings served by the same staircase; thus, a single 
physical building can consist of more than one “census 
building”) 

The RIL will be continually updated using building 
permits, demolition permits, utility records (water, gas, 
hydro, etc.), information supplied by local governments, 
and field observations. Thus, the RIL may be used to create 
a building sample frame for “large communes”. 

In each IRIS2000 (an IRIS2000 is a set of “‘lots 
regroupés selon des indicateurs statistiques” (blocks 
grouped by statistical indicators), a homogeneous area with 
a population of about 2,000) of each “large commune’, 
five rotation groups of addresses will be formed using the 
same sampling model as in “small and medium communes”. 
Three additional strata will be created in each IRIS2000: 
one for industrial buildings (plants, warehouses, etc.), 
another for collective dwellings (institutions, group homes, 
communal groups, boarding schools, efc.) and a third for 
new addresses. 

One fifth of the industrial buildings will be visited each 
year to verify that they contain no dwellings (custodian’s 
quarters or space converted for habitation); any dwellings 
found in such buildings will be considered self-representing 
because of their special nature. All collective dwellings will 
be covered each year; 20% of them will be visited, and the 
population counts of the remaining 80% may be updated by 
telephone. Finally, all new residential buildings will be 
inserted in the rotation groups. 


As noted above, each address rotation group will be 
visited once in each five-year period. A sub-sample of 
addresses, which corresponds to 40% of the dwellings of 
the group, will be selected. In each selected address, the 
complete dwelling content will be surveyed. 
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In summary, the annual sample will consist of some 8 
million individual forms, 6 million from “small and medium 
communes” and 2 million from "large communes". 


3. OVERALL AND DETAILED ESTIMATES 


In the continuous census system, three sets of estimates 
will be produced and published each year: a set of de jure 
population estimates, a set of detailed estimates (from 
which the de jure population estimates will be derived) and 
a set of overall estimates that will be used to calibrate the 
detailed and de jure population estimates. 


3.1 Overall Estimates 


According to current dissemination plans, the national 
and regional results of the survey conducted at the 
beginning of year A will be published on December 31 of 
year A. These estimates will be the overall estimates for 
year A. In addition, the results for each “small and medium 
commune” visited during the year A collection campaign 
will be published on the same date. 


3.2. Detailed Estimates 


Administrative files will supply additional information 
at a sufficient level of detail. It will then be possible to 
measure the systematic error between what has been ob- 
served and what is in the files for similar objects (buildings, 
blocks, etc.). This systematic error in carefully chosen 
aggregates can be used to produce an adjustment factor 
which will then be applied to the administrative data to 
ensure that their adjusted totals match the census estimates. 

Current plans are to use administrative files at a level of 
geographic aggregation (building, block, census agent 
district, etc.) that will provide information about individuals 
(age and sex according to health insurance files) or their 
dwellings (property tax files). 

Detailed results for year A-2 will be released on 
December 31 of year A. (Aquisition and processing of 
administrative files are expected to take about two years.) 
These detailed results will be a blend of survey data (large 


communes) or census data (small and medium communes) 
with synthetic data. 

The synthetic data will be obtained from the relationship 
between observed data and administrative data for the same 
point in time and space. For example, for commune C of 
Group II enumerated in year A-3 (census count denoted 
Re i ), the imputed census count for target year A-2 will be 
given by 


A-2 
Adm’ > Adm, 
ae) 1) A-3 mM), Ao" cele 
R =R = x —————_—__ 
cu *con Faye neu Fer 
Adm > Adm 


cell 


where Adm® is the value derived from administrative 
sources for commune c and year a. 

In the continuous census, for a “small and medium 
commune” surveyed in years A-5 and A (see the table 
below), person variables (age, sex, labour force activity, 
occupation, efc.) and dwelling variables (household size, 
number of rooms, tenure, conveniences, etc.) will be 
measured at two points in time. 

In addition, for Groups IV and V, the synthetic estimates 
for year A-2 could benefit from the information collected in 
the campaigns for years A-] and A respectively. Adjust- 
ment factors could be computed in relation to the most 
recent census and used to produce backward projections for 
the intercensal period. For example, for commune D in 
Group IV, we can compute the following: 


A-2 A-2 
i » Adm, ra > Adm; 
= IV - Vv 
Oh et ©, =Rp yx — 
y A-6 ? A-l 
Adm; y> Adm; 
celV cEelV 


It is virtually certain that these two series, extrapolations 
and backward projections, will not match. Nevertheless, it 
is best to publish just one set of estimates for any area and 
any point in time. It makes sense to produce a “composite” 
series whose end points are tied to census data. The 
following linear combination may accomplish just that 
while giving more weight to the more recent survey data: 


A-6 A-5 A-4 A-3 A-2 A-1 A 
Gr I Adm Adm R Adm Adm 2? Adm Adm Adm 
Grll Adm Adm Adm Ry Adm _,., Adm Adm Adm 
Ron 
Gr III Adm Adm Adm Adm Ry, Adm Adm Adm 
GrIV R Adm Adm Adm Adm 2? Adm R Adm Adm 
Gr V Adm R Adm Adm Adm 2? Adm Adm R 
Total 5R LAdm 5R Adm 5R Adm 5R LAdm 5R LAdm 5R LAdm 5R Adm 
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Ro, iv = 0.2 XO, +0.8xO,. 


Similarly, for commune E in Group V, with QI and Q2 
appropriately defined, we would have: 


Rey =0.4x@, +0.6x®@,. 


Adjustment factors © will have to be calculated for 
relatively detailed population strata, such as age-sex classes, 
so as to keep as much demographic and geographic flexibi- 
lity as possible in the census adjustment. The quality of the 
administrative files and local disparities will dictate the 
level at which the adjustment can be made most conve- 
niently (for départements, metropolitan areas, ...). The same 
process can be applied to large communes if we replace 
“small commune” with “address”. 

Finally, when every commune in every group has been 
imputed, the estimated total for a variable of interest from 
the imputed file (detailed estimates) is unlikely to match the 
total estimated from observations alone (overall estimates 
published two years earlier). It has therefore been decided 
that the detailed estimates will be calibrated on the overall 
estimates. Once again, the level of calibration will depend 
on local trends and the quality of the overall estimates. 


3.3. De Jure Population Estimates 


The de jure population estimates are the third set of 
estimates derived from the census. They are the population 
figures that are used, by law, to determine commune 
funding, electoral boundaries, the composition of municipal 
councils, etc. 

The “total de jure population” of a commune includes 
persons 
— whose principal residence is within the commune, 

— who live in an institution or a collective dwelling 
located within the commune, 

— who have a residence in the commune and live in an 
institution or a collective dwelling located in another 
commune but have kept a dwelling in their commune 
of origin, 

— who live ina collective dwelling in another commune 
for work or live in another commune for education, 

— who are attached to the commune for administrative 
purposes (itinerant workers, sailors and so on). 


Clearly, these populations cannot be estimated until the 
entire territory of the commune has been covered, that is, 
until the detailed estimates have been produced. 


3.4. Estimation of Sampling Variance 


The global and detailed estimates will be accompanied 
by a measure of their statistical quality. Work on this 
project began in the fall of 2001. The preferred option at 
this time is to use reference tables, as is done in the 
Canadian Labour Force Survey, for example. The sampling 
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variances will probably be obtained by resampling the 
frame. 


3.5. Imprecision Due to Synthesis 


In the section 3.2, we showed how collected data will be 
used to produce synthetic estimates: first, an extrapolation 
for an “old” census, for two rotation groups (I and II, say); 
then directly using the census results for a third rotation 
group (III, say); and finally, combining extrapolations and 
backward projections to calibrate the last two groups (IV 
and V, say). 

This synthesis can be formalized using a non-response 
model (Sarndal 1990). The annual campaign is similar to a 
take-all survey that has an 80% non-response rate, which is 
dealt with using ratio imputation. If we let s represent the 
whole sample, r the respondents and s-r the non- 
respondents, we have 


Ve le ee 
Via. with B =— 
Die BIiels 27 x, 


Thus, the imputation model is 
Vie Pip Sp. 
Ge Ee) =0 
V(e,) = 07 x, 


where the errors €, are not correlated. With such a model, 
under simple random sampling, 
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The uncertainty around estimation with imputation depends 
on the sampling errors and the quality of imputation model 
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assuming that the design and response mechanism are 
independent from imputation. Using imputed data as if they 
were observed data to compute the estimate of V_. results in 
an underestimate of V. In terms of expectation, 


sample* 
E eA V, - V ) =V 5 if 
For the estimators of these variances, Sarndal shows that 
we get 


=N2 - =) {s? +c, 6} 
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with A=x,,/x,, which we can take as a respondent 
selection effect. We note that if x, = 1, then we obtain a 
two-phase scope of size m inn andninN. In addition, if 
=r,V, 
total — 

In Sarndal’s model, the x (administrative data) and y 
(census data) are contemporaneous; at the very least, we 
will have observed some of the y. Using the structure 
developed in the previous section, we would have: 


In the continuous census system, not everything is 
synchronous: 


.. yA-3 yA-3 yyA-2 
That is, Y,, °,X;, ,Y, , and Xan are not all measured 


or observed in the same year. In fact, if we look at Group III 
on its own, for example, we have a sample of size n in year 
A-2 and an identical but totally non-respondent sample in 
year A-3. Consequently, some parameters in the estimate of 
Vota, Cannot be calculated. 

On the other hand, if we take the problem over a specific 
period, we have a sample of size n and 4n non-respondents. 
We could approximate the uncertainty of the asynchronous 


imputation process (the process we have in the redesigned 
census) with the uncertainty of the synchronous imputation 
process (similar to Sarndal’s model). 

This approach was tested on the small and medium 
communes of Rhéne-Alpes, for which the rotation groups, 
1990 property tax data and 1990 population census data are 
available (Kauffmann 2000). The method gives good results 
for variables that are highly correlated with property tax; the 
results also indicate that a source of administrative data that 
are similar to variables describing people will be necessary 
to maintain the model errors at an acceptable level. 


4. WORK IN PROGRESS 


The methodological work involved in redesigning the 
census is far from complete. The following projects are still 
under way: 


— establishment of rules for crossing the size threshold, 
problems of oscillation around the 10,000 population 
threshold, and calculation of the de jure population; 


— the sensitivity of stratum boundaries in 
communes and their robustness over time; 


large 


— the updating and maintenance of sampling frames and 
samples, especially adjustments that may be required 
when a commune crosses the size threshold and the 
incorporation of new objects into rotation groups; 


— massive imputation and synthesis, both models and 
their precision; 


— estimation of the precision of estimators; and 


— collecting data from mobile population groups. 
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Benchmarking Parameter Estimates in Logit Models of Binary Choice 


and Semiparametric Survival Models 


IAN CAHILL and EDWARD J. CHEN’ 


ABSTRACT 


An approach to exploiting the data from multiple surveys and epochs by benchmarking the parameter estimates of logit 
models of binary choice and semiparametric survival models is developed. The goal is to exploit the relatively rich source 
of socio-economic covariates offered by Statistics Canada’s Survey of Labour and Income Dynamics (SLID), and also the 
historical time-span of the Labour Force Survey (LFS), enhanced by following individuals through each interview in their 
six-month rotation. A demonstration of how the method can be applied is given, using the maternity leave module of the 
LifePaths dynamic microsimulation project at Statistics Canada. The choice of maternity leave over job separation is 
specified as a binary logit model, while the duration of leave is specified as a semiparametric proportional hazards survival 
model with covariates together with a baseline hazard permitted to change each month. Both models are initially estimated 
by maximum likelihood from pooled SLID data on maternity leaves beginning in the period 1993-1996, then benchmarked 
to annual estimates from the LFS 1976-1992. In the case of the logit model, the linear predictor is adjusted by a log-odds 
estimate from the LFS. For the survival model, a Kaplan-Meier estimator of the hazard function from the LFS is used to 


adjust the predicted hazard in the semiparametric model. 


KEY WORDS: Microsimulation; Benchmarking; Semiparametric survival models; Binary logit. 


1. INTRODUCTION 


Researchers often base econometric models on a survey 
conducted over a short period of time. In this case it may be 
desirable to incorporate information from a supplementary 
data source covering a longer period, even if measurements 
are only available for the dependent variable. For a broad 
class of non-linear models, we develop a simple method of 
benchmarking the parameter estimates obtained from a 
survey rich in explanatory variables to information from a 
survey with significant historical depth. A primary objective 
is that model predictions accord with information from the 
secondary data source. We demonstrate application of the 
method first to a simple logit model of binary choice, and 
secondly to a semiparametric survival model. Since the 
survival model can be viewed as a sequence of binary 
choices, while retaining an interpretation as an incompletely 
observed continuous time model, it provides a natural 
generalization of the first application. 

The illustration we provide is a study of maternity leave. 
The Statistics Canada Survey of Labour and Income 
Dynamics (SLID) provides data on both the incidence of 
choosing a maternity leave over withdrawing from the 
labour force, and on the duration of maternity leave, as well 
as a rich set of explanatory variables. Because of this we 
use SLID to estimate base parameters, including those 
determining the effects of the explanatory variables on the 
incidence (the logit model) and hazard of returning to work 
(the survival model). The Canadian Labour Force Survey 
(LFS) conducted by Statistics Canada provides reasonable 
proxies for both the incidence and duration extending back 
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to 1976. The SLID parameter estimates are therefore 
benchmarked to LFS estimates of incidence and the hazard 
of returning to work during the period 1976-1992, which is 
prior to the availability of SLID data. 

The work was carried out while developing the maternity 
leave module of the LifePaths microsimulation model at 
Statistics Canada. The goal of the LifePaths project is to 
construct a dynamic microsimulation model encapsulating 
as much detail as possible on socio-economic processes in 
Canada, as well as the historical patterns of change in those 
processes. LifePaths has been employed in a broad range of 
policy analysis and research activities. Examples include 
Canada Student Loan policy (under contract to Human 
Resources Development Canada and the Government of 
Ontario), returns to education (Appleby, Boothby, Rouleau 
and Rowe 1999), time use (Wolfson and Rowe 1996; 
Wolfson 1997; Wolfson and Rowe 1998a), tax-transfer and 
pensions (Wolfson, Rowe, Gribble and Lin 1998; Wolfson 
and Rowe 1998b), and labour force careers (Rowe and Lin 
1999). In addition, the task of assembling data for LifePaths 
has required new research into, for example, educational 
careers (Chen and Oderkirk 1997; Rowe and Chen 1998; 
Plager and Chen 1999) and earnings correlation (Chen and 
Rowe 1999). 

LifePaths is intended to incorporate socio-economic 
information from all relevant sources available to Statistics 
Canada. Consequently the construction of the model has 
motivated research into application of methodologies for 
exploiting multiple data sources. Embedding an estimated 
model in LifePaths is a powerful tool for deriving impli- 
cations of the model that can be compared to information 
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from other sources. For example, Rowe and Lin (1999) 
derived job tenures by simulation from a model estimated 
using short-period longitudinal data, then compared the 
results with data from a cross-sectional survey. We report 
on one aspect of the continuing effort to build a tool 
providing the maximum information that can be extracted 
from Statistics Canada’s data sources. 

The paper is organized to illustrate the way in which 
technical problems are often encountered in the course of 
building LifePaths, and how their solution is integrated with 
the model development process. To do this, a fair amount 
of background detail on associated issues is provided. 
Section 2 outlines the context of the benchmarking 
problem, and section 3 presents the theory behind our 
solution, with some possible extensions for further work. 
Section 4 describes the models to which it will be applied, 
including some details concerning the estimation of their 
parameters in the base period, then section 5 describes the 
application of the benchmarking method to these models. 
We display and discuss our empirical results in section 6, 
then close with some overall conclusions in section 7. 


2. CONTEXT OF THE PROBLEM 


We provide context in this section by presenting an 
overview of the LifePaths model structure, a brief descrip- 
tion of data sources involved, and a discussion of how the 
benchmarking problem arose. 


2.1 Structure of the LifePaths Model 


The LifePaths model simulates individual lifetimes as a 
series of events which modify the set of “state variables” 
describing the demographic, social, and economic circum- 
stances of the individual. Waiting times to every possible 
event are associated with an individual, although they may 
be infinite. The waiting times may be conditioned on the 
values of state variables. The event type with the shortest 
waiting time occurs (its associated functions are called). 
Modification of any state variable at the occurrence of an 
event may lead to the generation of new waiting times for 
other events. 

LifePaths initialises a case by randomly generating a 
“dominant” individual’s sex, province of residence, age at 
immigration and year of birth. The year of birth can range 
from 1892 to 2051. Mortality and immigration assumptions 
are designed to reproduce provincial age-sex structures. 
When a dominant individual marries, enters a common-law 
union, or has a child, a non-dominant individual of suitable 
characteristics is created and is linked to the dominant 
individual, forming part of the case. Once created, non- 
dominant individuals undergo the same possible events as 
dominant individuals. However, since their purpose is to 
complete the profile of the dominant actor, they are usually 
filtered from all tabular reports. 


LifePaths presently includes models of fertility, 
mortality, marriage (including common-law unions), educa- 
tional careers, labour force careers, maternity leave, hours 
of work, earnings, taxes, and transfers. The model of the 
labour force careers describes transitions between the states 
“paid employee,” “‘self-employed,” and “not employed.” It 
also includes a model of retirement and student work. The 
model of secondary and post-secondary educational careers 
at the provincial level is mature and highly developed. 


2.2 The Data Sources 


The estimation of base parameters for the model of 
maternity leave was carried out using data from SLID 
covering maternity leaves beginning in the period 
1993-1996. Using data from 1997 allowed us to follow 
most maternity leaves to completion rather than using 
extensively censored data. This is a household survey 
designed to permit both longitudinal and cross-sectional 
analysis of people’s financial and work situations. Starting 
in 1993, SLID follows the same respondents for six years, 
with new rotation groups introduced every three years. Each 
rotation groups includes about 15,000 households with 
30,000 adults. From this survey we obtain the month of 
child birth, monthly data on labour force status, and a rich 
set of explanatory variables including job tenure, an 
indicator of self-employment, birth order of the child, 
presence of an employed spouse, province of residence, 
education level, and age. We can also determine if a mother 
who left a job within 4 months of birth has returned to the 
same job within 16 months. This is used as a practical 
definition of maternity leave and becomes our unit of 
analysis, with a slight expansion to include the 1% of cases 
where a mother returned to a different job from a labour 
market state of absence in the previous month. Using this 
unit of analysis we get a sample size of 835 births. As we 
show in section 6, this sample size is adequate to reveal 
some key explanatory factors. More precisely, several 
factors are found to be significant at the 95% confidence 
level. This sample contains about 730 unique mothers, 
representing over 87% of the sample of births. This means 
that there will be some correlation between observations as 
a result of those mothers who have two or more maternity 
leaves within the observation period, but we did not feel 
that it is of sufficient magnitude to warrant any special 
statistical tools. 

The LFS is a monthly household survey focussing on 
labour force status, and also reporting a number of 
demographic characteristics. The survey is normally used 
exclusively for cross-sectional analysis. For the LifePaths 
project, however, a file covering the period from 1976 to 
1995 was constructed that follows individuals as they rotate 
through the six monthly rotation groups of the survey, 
providing a six-month window on each individual’s labour 
market activity. Since the number and ages of children are 
recorded each month, it is possible to observe the 
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appearance of a new child. Since all surveys throughout the 
period are used, the sample size is very large, and about 
26,000 births are observed. 

In the LFS window we note the labour force status of a 
new mother when the child is first reported. This is the key 
to estimating the probability of choosing a maternity leave, 
rather than leaving the labour force. We begin by consi- 
dering P(E), the proportion of such mothers who are 
employed. If the mother is “employed, at work,’ we 
suppose that they took a brief absence from their job — less 
than a month. If they are “employed, absent from work,” it 
may be that they have chosen to take a maternity leave 
absence from their job and then return to it. However this 
may not always be the case. A new mother who we observe 
as employed and absent (EA) may later make a transition 
out of employment (to NE). To correct for this, considering 
mothers with a child of age less than a year observed in a 
window, we calculate the proportion P(EA-NE) of 
transitions out of the “employed, absent from work” state 
that are to a not-employed state. We also estimate the 
proportion P(NE~-OVJ) of mothers who return to an old job 
(OJ) after having left employment. The estimate is obtained 
by using observations on mothers with a young child who 
make transitions from a not-employed state to a job with a 
start date earlier than the previous month. Our estimate of 
the probability of choosing a maternity leave is now P(E) - 
P(EA ~ NE) + P(NE - OJ). 

It is also possible to observe mothers with a child of age 
less than a year making a transition from the status 
“employed, absent from work for personal or family 
responsibilities” to the status “employed, at work.” We use 
this transition as a proxy for the return to work after a 
maternity leave. Since the duration of absence is reported in 
the previous month, this is the key to benchmarking the 
survival model. 

The preceding discussion illustrates the weakness of the 
LFS data for a study of maternity leave, relative to SLID 
data. In addition to having fewer explanatory variables 
available than in SLID, we must accept proxies for the 
dependent variables. Nevertheless, we require the historical 
depth of the LFS. This relationship between the data sets is 
the context of the benchmarking problem described in the 
next section. 

Both the SLID and the LFS have complex sample 
designs involving detailed stratification, and complex 
methods for calculating observation weights. We always 
make use of observation weights, both in estimation and in 
the calculation of frequencies. The methods used are fairly 
simple, and are discussed in sections 4 and 5. 


2.3. The Benchmarking Problem 


The context of our benchmarking problem is a model of 
women choosing between leaving the labour force or taking 
a maternity leave, and if they choose a leave, deciding how 
long that leave should be. The first decision is represented 
by a binary logit model, and the second by a semiparametric 
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survival model, both including a vector of explanatory 
variables and associated parameters. In LifePaths, the 
decisions are made as part of the maternity leave choices 
event, which always occurs in the middle of a pregnancy. 
SLID is quite adequate for estimation of the base para- 
meters of both these models. However, since a major goal 
of the LifePaths project is to incorporate historical patterns 
of change in socio-economic processes, it was necessary to 
benchmark the SLID parameter estimates to annual esti- 
mates of dependent variable means obtained from the LFS. 

In this problem, we assume stable observed charac- 
teristics of the population. There are two reasons for this. 
First, LifePaths is a work in progress, and the benchmarking 
exercise we report on was carried out at a stage when other 
parts of the model that predict these characteristics were 
being extensively revised. In section 3.3, we touch on the 
consequences of evolving population characteristics. 
Second, we suppose that the primary reason for systematic 
change in observed outcomes between time periods is 
change in some factors not included in the measured 
characteristics of individuals. In the case of our application 
we observed a trend towards choice of maternity leave over 
leaving the labour force which seems to be due to social 
change rather than changes in the composition the 
population. We also observed a change in the distribution 
of maternity leave durations that appears to be due to 
changes in the Unemployment Insurance (UI) program 
implemented in Bill C-21 in 1990. At that time Parental 
Benefits were introduced, which extended the period during 
which many mothers could receive benefits from 15 to 25 
weeks. Many mothers return to work at a time close to 
when they have exhausted UI benefits. 


3. BENCHMARKING METHODOLOGY 


In this section we present the method in an abstract form 
in order to clarify the assumptions, develop notation, and to 
reveal the similarity between the application to binary 
choice and to survival analysis. 


3.1 Application to Binary Choice 


The basic model for the benchmarking methodology 
relates to binary choice. Since we are not primarily 
interested in changes in the population, we simplify the 
analysis by assuming that the explanatory variables or indi- 
vidual characteristics in period t are represented by a series 
of independent identically distributed random vectors X‘. 
We recognise that this is quite a strong assumption. Never- 
theless, for the reasons discussed in section 2.3, we use it 
our empirical work. Section 3.3 shows that it is a fairly 
simple matter to extend the theory to incorporate trends in 
the independent variables. 

Consider a linear predictor given by 


n(x) =B’x+¥' (3.1) 
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where f is a vector of coefficients constant over time, x is 
a possible outcome of X‘, and y‘ represents a parameter 
specific to period t. Notice that x contains no “constant 
term.” Let Y* be a random variable, jointly distributed with X*, 
that takes the values 1 if an event occurs and 0 if it does not. 
Suppose that the probability of the event, conditional on 
characteristics x, is given by 


E(Y*|X* =x) = n(x) = Fy) (3.2) 


where we require F to be a continuous distribution function. 
The values of the function will then be bounded by zero and 
one, and it will have an inverse g, so that 


1) = g(a"). (3.3) 


In the context of generalised linear models, g is called a link 
function. We begin by finding maximum likelihood esti- 
mates of the base parameters 8 and 7° using data for the 
time period 1, (in our case this is the period when SLID 
data are avlblee Of course these data must include 
variables corresponding to outcomes of both X* and Y*. 
It remains to estimate * for each period t. Equations (3.1) 
and (3.3) imply that 


E{n'(X*)-9°(X®)} =" -7° =E{ 2 (n"(X"))} 


-E{g(n(X"))}. (3.4) 
Since we have observations only on the outcomes of Y* 
from the LFS for every period, we estimate the terms y* by 

PEO SOE GE) (3.5) 
where i’ is an estimate of E(Y‘). Using the LFS, this 
estimate is the weighted frequency of the event in the time 
period t (taking each weight from the month where a child 
is first observed). To justify this procedure we use equation 
(3.4) and assume an approximation 


E{g(n'(X*))} -E{g(2°(X °))} = g(E{a'(X)}) 
- g(E{n°(X)}). (3.6) 


Inaccuracy will arise due to Jensen’s inequality in regions 
where g is convex or concave. Nevertheless, if g can be 
locally approximated by a linear function in the regions 
where 2(X*) and 2°(X 0) are concentrated, then (3.6) 


may be quite accurate. The fact that g has an inflection . 


point at 0.5 may aid the approximation when probabilities 
are dispersed around this value. 

Fortunately we are able to test the adequacy of the esti- 
mator by simulating the estimated model in LifePaths and 
comparing the predicted frequencies of the event with 
corresponding weighted frequencies observed in the data. 
The results indicate that it is quite adequate for our appli- 
cation. 


3.2 Application to Survival Analysis 


We will show in section 5.2 that the approach outlined 
above can also be extended for use with a semiparametric 
survival model by adding an index ¢ representing the 
duration in the current state, so that (3.5) becomes 


F(t) = F(t) + g(#(t)) - g(*'%(1)) 


where it'(t) represents the empirical hazard function. 


(3.7) 


3.3. Trends in the Independent Variables 


The benchmarking method may be improved by taking 
the changes in observed characteristics into account. As we 
noted in section 2.3, this would be considered when other 
parts of LifePaths are in a more mature form. To do this we 
relax the assumption that the random vectors X* are 
identically distributed. Equation (3.4) then becomes 


E{y'(X*) -1°(X)} =y* - 7" + B{E(X*) -E(X ®)} 


-E{g(m'(X*))} 
-E{g(n°(X))} (3.8) 
Based on this, we might estimate y‘ by 
Pepe Meh ePiGia ra) 22) 


where <x‘ is the vector of mean values of the characteristics 
in period t. Of course it may not be possible to obtain all of 
the mean values from the same data source. The method 
would extend to the survival model case in the same manner 
as (3.7) to give 


TO =7°() + 9 (*(D) - 2 (RN) 


-B'(E(D - (80). (3.10) 


4. MODELS AND THE ESTIMATION OF BASE 
PARAMETERS 


As explained in section 3.1, the base parameters 6 and 7° 
are estimated by maximum likelihood using data from the 
period t,. We use data from SLID on all maternity leaves 
beginning in the period 1993-1996 (our base period 1 ov: 
We do not attempt to estimate annual changes in the 
constant term y throughout this period. 


4.1 The Binary Logit Model 


We adopt the logit model to represent a mother’s choice 
between taking a maternity leave and withdrawing from the 
labour force. From now on we adopt a more conventional 
econometrics notation and use a subscript i to index a 
random variable or outcome associated with an individual 
1. We suppose that a random variable Y,’ takes values 0 or 
1, with Y,' = 1 indicating that new mother i i with vector of 
characteristics x, in period t chooses to take a maternity 
leave, conditional on her having been employed, and that 
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: : , _ _xp(nj) 
Tia ROW Gad aes sa (4.1) 
1 + exp(nj) 
where n, = B’ x, +‘ is the linear predictor of equation (3.1) 


and F is the logistic distebation function. We estimate the 
base parameters § and 7° using N observations from SLID 
by maximising the log-likelihood In L(B, y°) where 


BAe AOI Ce ON ARIAS) 


I} [ln FD) TL ea) 
> ia ik 


(4.2) 


i] 


Trop y'tl - Fa” 


and 


InL@B,y)= D> (y,In F(n;) 


Paey jini =F ie 


Longitudinal SLID weights in the year of the child’s birth 
are scaled to sum to the sample size, and are then used to 
weight the terms of the log-likelihood and its derivatives. 
The weighted score equations are 


OL(B, Y') _ 
fies ee 


GE.) yw. y,- ¥w, Fen) = 0. 
oy" i i 


i*idi - Do wx; F (nj) 70 


(4.4) 


The solution, which maximises the log-likelihood, was 
found by Newton-Raphson iteration. The logit model has 
been used often by statisticians and econometricians, and 
there is an extensive literature. For example, see Chambless 
and Boyle (1985), Roberts, Rao, and Kumar (1987), and 
Morel (1989). 


4.2 The Semiparametric Survival Model: 
Basic Form 


For mothers who have chosen to take a maternity leave 
from their job, we use a survival model to describe the dura- 
tion of their leave. The probability density function (pdf) of 
the distribution has a complex shape, as can be seen from 
the graphs in section 6.4. There is spike at durations of less 
than a month and a mode which appears to represent the 
maximum Unemployment Insurance special benefits 
entitlement available to mothers after 1990 (15 weeks of 
Maternity Benefits, plus 10 weeks of Parental Benefits, plus 
a two-week waiting period). We began the study by 
estimating various fully parametric models, including a 
log-logistic survival model combined with a logit model to 
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predict durations of less than a month, but were unable to 
obtain an adequate fit. To solve this problem, we follow 
Prentice and Gloeckler (1978), Han and Hausman (1986) 
and Meyer (1990), by nonparametrically estimating the 
effect of time on the hazard of returning to work. The 
hazard of returning to work is specified in a proportional 
hazards form: 
hj (2) = Ag (4) exp {B’ x; (0)} (4.5) 
where do (t) is the unknown baseline hazard at leave 
duration ¢ and time period t, x,(t) is a vector of explanatory 
variables for mother i, and B is a vector of coefficients. 
The data tell us which of the intervals [0,1), [1,2), [2,3), ... 
contains the spell duration (in our case the units are 
months), and the model can be interpreted as an incomple- 
tely observed continuous time hazard model with no restric- 
tion on the form of the baseline hazard. If Li, is the duration 
of leave for mother 7 during period t, then for t= 1, 2, 3, 
., the probability that the spell lasts until time t, given that 
it has lasted until t - 1, can be written as 
P(T, >t|T; 2t-1) = exp| ~f/_ AG) du | 


I 


exp|-exp(B’x,(0} 57 Agu)du| (4-6) 


if we assume that x,(¢) is constant on the interval between 
t- 1 and t. In order to apply the theory of section 3, we can 
rewrite equation (4.6) as 


P= OSRT oad 21) 
= exp[-exp{ B'x,() + °@}) 
= exp[-exp{n;(1)}] (4.7) 

where 

y(t) = In | jean (u) du}. (4.8) 
One may censor any ongoing observations at some large 
duration T. Again we can estimate the base parameters B 
and 7° using N observations from SLID by maximising the 
log-likelihood InL(y°,B). Since we will always be 
referring to data from the base period for the remainder of 


section 4, we drop superscripts T,. 
The likelihood function is given by 


L(y, B) = 


N * 
ITtl -exp { -exp(n,(k)))}]' 
I exp { -exp(n,(2))}] (4.9) 
t=1 
where y = [y(1), y(2), .... ¥(T)]’, C; is a censoring time, 


6, = 1 if T, < C, and 0 otherwise, k; = min (int(7;), C,). The 
log-likelihood 1s therefore 
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N 
In L(y, B) = )> [6, In[1 - exp { -exp(n, (k,))}] 
j=1 


t 


k; 
- exp (n,(0). (4.10) 

t=1 

Weights from the months that a child is first observed are 
scaled to sum to the sample size, and then used to weight 
the terms of the log-likelihood function and its derivatives. 
The weighted log-likelihood function is maximised by the 
quasi-Newton algorithm of Broyden, Fletcher, Goldfarb, 
and Shanno (BFGS), using an implementation based on 
Dennis and Schnabel (1983). 


4.3 The Semiparametric Survival Model: 
with Work-to-Birth Gap Decision 


The situation in our application is complicated somewhat 
by our desire to model the duration from leaving the job 
until the birth (the work-to-birth gap), as well as the hazard 
of returning to work from a maternity leave. The model of 
work-to-birth gap is estimated separately, based on SLID 
data. Examination of the mean gap duration for each year 
in the LFS data indicates that this duration has been fairly 
stable over time, so the model is not benchmarked. Never- 
theless, a modification of the semiparametric survival 
model is necessary to incorporate the separate model of 
work-to-birth gap. This can be accomplished by assuming 
that the work-to-birth gap decision, possibly involving 
health considerations, acts to constrain the desired total 
duration. This means that the above model would apply to 
the desired total duration, which is unobservable, and might 
be labelled T*. 

In cases where the desired duration was shorter than the 
work-to-birth gap, the mother might return to work as soon 
as possible after the birth. This means that in cases where 
we observe a significant work-to-birth gap (greater than a 
month), and the mother returns soon after birth (within a 
month), all that is known about desired duration is that 


Te cis 


where T is the total duration of leave. This is equivalent to 
a situation labelled “left censoring” by Cox and Oaks 
(1984, page 178), where observation does not start imme- 
diately and some individuals have already failed before it 
does. 
From such an observation we get a contribution to the 

likelihood function and its logarithm given by 

k; 

Ea h=[fP Gam 3 24)S1 
ie (4.11) 


-Tlexpt-exp(n,(0)) 


and 


k; 
In (L,) = In{1 - exp[-}> exp (n,(9)]}. (4.12) 
f=1 


Unfortunately the log-likelihood expression does not 
simplify like the corresponding expression for “right- 
censored” observations. In spite of this, Monte Carlo 
experiments indicate that estimation is not a problem even 
in heavily censored data sets. 

Longitudinal SLID weights in year of the child’s birth 
are used in same manner as for the basic form of the 
survival model. 


5. BENCHMARKING THE MODELS 
To begin the benchmarking procedure we must invert the 
distribution function F given in equation (3.2) to find the 
link function g. We then apply equation (3.5) in the case of 
the logit model, and equation (3.7) in the case of the 
survival model. 


5.1 Application to the Binary Logit Model 


To benchmark the logit model we first invert the logistic 
distribution function in equation (4.1) to obtain 


7; 


nN; = g(m,) = In (5.1) 


ais 

where g is the well-known logit function. We can then 
apply equation (3.5) and (5.1) to obtain 

t/I—%) 


7 =7° +g (A) - 9 (#%) =7 + In 
(1 - 2") 


| (5:2) 


where for t<1), each #* is the frequency of choosing 
maternity leave calculated from LFS data for maternity 
leaves beginning in year t, and #° is the frequency from 
SLID data. 


5.2 Extension to the Survival Model 
From equation (4.7) we get 


m;(t) = 1 - exp[-exp {n;()}] = F{nj(9} 


where 


(5.3) 


1; (f) = B’x,@ + y* (0. 

In this case F is an extreme value distribution that is easily 
inverted to obtain 

n; (1) = In[- In(1 - 2;())] = gi (n). 

For benchmarking we can use equation (3.7) with the 


observed frequencies in period t represented by the 
empirical hazard or occurrence/exposure ratio given by 


(5.4) 


i.) 
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R(t) =d*()/r@) (5.6) 


where, for spells beginning in period t, d‘(f) is the number 
of mothers who fail in the interval (t-1, t]and r‘(t) is the 
number of mothers in view at duration ¢, including those 
censored at time ¢ (censoring can only occur at the end of 
intervals). Numbers of mothers were calculated from 
sample counts by applying the LFS weight from the month 
that a new mother returns to work. The empirical hazard 
and the corresponding estimator for the survivor function 
implied by the product law of probabilities were studied by 
Kaplan and Meier (1958). The use of the empirical hazard 
in equation (3.7) together with equation (5.5) yields 


In[1 - #(2)] 
In{l -#°(d)] 


(= 9 (in (5.7) 


6. EMPIRICAL RESULTS 


The results of estimation in the base period, and the 
results of simulation with benchmarked parameter estimates 
are presented for both models. The simulation results are 
compared with annual survey sample frequencies of 
choosing a maternity leave in the case of the logit model, 
and with annual survey frequency distributions of maternity 
leave duration in the case of the survival model. 


6.1 Estimation Results for the Binary Logit Model 


The estimation results obtained from estimating the logit 
model from SLID data are presented in Table 1. Omitted 
dummy variable categories, which form the reference cate- 
gories for the variables used in the model, were province of 
residence Ontario and highest education level “some post 
secondary.” Individual and family income variables were 
tested, but were found not to be significant, and so were not 
included in the regression. 

There may be some bias in the estimates, particularly 
those of the standard errors, due to the fact that the complex 
SLID sample design was accounted for only through the 
weights applied to the log-likelihood. 

The significant positive effect of job tenure seems 
reasonable for a number of reasons. A lengthy tenure might 
indicate that the woman has acquired firm-specific human 
capital and has achieved some seniority. It would also be an 
indicator of strong attachment to the labour force generally. 
On the firm side, the longer the woman’s job tenure, the 
longer the leave that the firm is likely to grant with a 
guarantee that she can return to her job. Also, provincial 
government guarantees of job security also depend on job 
tenure. Finally, a lengthy job tenure means that the woman 
will likely meet the Unemployment Insurance eligibility 
requirements (20 weeks of insured employment). A dummy 
variable indicating that UI entrance requirements were met 
was tested and found to be just significant at the 5% level. 
However, because we are not able at this stage to model 
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changes in the UI program through the influence of 
covariates, because of uncertainty in interpretation, and 
because of high correlation with job tenure, it was not 
included. In the LFS, self-employed workers are reported 
as having a transition out of employment only when they 
terminate their business. Since taking a leave simply means 
not terminating the business, a significant positive effect for 
the indicator of self-employment is to be expected. Having 
been self-employed before the birth increases the odds of 
taking a maternity leave by 333%, the strongest effect that 
we see for an indicator variable. 


Table 1 
Binary Logit Parameter Estimation Results 


Parameter Estimate of Contribution StdErrorof  Prob- 
Coefficient to Odds Ratio* Coefficient Value 

Constant -6.432 0.002 2.995 0.0318 
NFLD -0.829 0.436 0.741 0.2636 
PEI 0.931 DST 1G12"0:5633 
NS -0.456 0.634 0.541 0.3992 
NB 0.207 1.230 0.675 0.7596 
QUE -0.361 0.697 0.247 0.1437 
MAN -0.490 0.613 0.503 0.3306 
SASK -0.163 0.850 0.458 0.7218 
ALTA -0.200 0.819 O33 2505379 
BC -0.120 0.887 0.300 0.6899 
Job Tenure (mths)/10 0.094 1.099 0.026 0.0003 
Self-employed? 1.203 3.330 0.418 0.0040 
Age (Years) 0.479 1.614 0.199 0.0160 
(Age’2)/10 -0.071 0.931 0.033 0.0296 
< High School Grad -0.702 0.496 0.357 0.0490 
High School Grad -0.148 0.862 0.276 0.5913 
University Grad -0.292 0.747 0.229 0.2027 
First Child? -0.525 0.592 0.192 0.0063 


log-likelihood = -381.553 

Number of Observations = 835 

Observations are given the SLID longitudinal weight from the year of 
birth, scaled to sum to the sample size 


* This is the exponential of the coefficient. It may be interpreted as 
the proportional change in the odds ratio due to a unit change in the 
corresponding independent variable. 


The effect of the first child indicator also seems reason- 
able. The odds for maternity leave for a first-time mother 
is only 59% of the odds for maternity leave for a mother of 
more than one child, given that all other characteristics are 
the same — i.e. first-time mothers are more inclined to job 
separation than the mothers who already have children. 
This may be partly a consequence of the fact that our 
sample consists of mothers who have been employed within 
4 months of the birth. Mothers who have more than one 
child tend to space them within a few years at most. If they 
are employed just before a second or subsequent births, 
they will have already demonstrated that they returned to 
work after an absence that must have been less than the gap 
between births. This at least rules out some common 
patterns of withdraw from the labour force — for example 
staying at home until all children are in school. 
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The effect of age is more difficult to interpret since the 
effect on the log-odds ratio is non-linear. By drawing a 
graph of the term -.479* age - .0071* age’ one can see that, 
as age increases, the log-odds of taking a maternity leave 
first increases, but that the rate of increase declines until a 
level point at the maximum log-odds is reached by the age 
of 34. Since the number of mothers declines considerably 
after this age, the subsequent decline may not be 
meaningful. One might hazard a conjecture that, among 
young mothers, being relatively older indicates more 
attachment to the labour force and thus a stronger tendency 
to take a maternity leave, while among older mothers, who 
are past the stage of first entering the labour force, this 
effect is reduced. However, the results are probably not 
precise enough to draw any firm conclusion about this. 


6.2 Simulation Results for the Benchmarked Binary 
Logit Model 


The benchmarking exercise consists of adjusting the 
constant term of the model in the manner described by (5.2) 
for each year in the period 1975-1992. The constant term is 
not adjusted after 1992, partly because the LFS data do not 
indicate a strong trend after 1992. The model is then 
incorporated in LifePaths and a simulation is run. For each 
year from 1976 to 1995, Figure 1 shows both the frequency 
of choosing a leave in the LifePaths simulation, and the 
frequency estimated from the LFS. For the period 
1993-1995, estimates from SLID are also presented. 


Frequency 


---LFS se LFS (3-yr MA) 


==—-SLID: — LifePaths 


o+—+—__—__,—_, an +——1—+ — 


- = = 
SoS Sr PSD MH SCP CPS SF OPS SP DP 
QD DS SF PS PP SF SP SFP PP PM PF FS 


Figure 1. Frequency of Choosing a Maternity Leave 1976-1996 


The simulation captures the change over time revealed 
by the LFS data during the period 1976-1992. There is no 
benchmark adjustment implemented in the LifePaths simu- 
lation after 1992, so that the base parameters estimated from 
pooled SLID data 1993-1996 are effective. The simulated 
frequency is slightly lower than the observed SLID 
frequency during this period. Two possible sources of error 
are an insufficiently flexible specification of the binary 
choice model, and differences between the SLID estimates 
of explanatory variables and those provided by LifePaths. 


6.3 Estimation Results for the Survival Model 


The results obtained from estimating the semiparametric 
survival model from SLID data are presented in Table 2. 
As in the binary logit model estimation, omitted dummy 
variable categories were province of residence Ontario and 
highest education level “some post secondary.” Since the 
dependent variable is the hazard of returning to work, a 
positive coefficient for a covariate indicates an influence 
that tends to shorten the duration of maternity leave. 

The estimates of the constant terms in the duration- 
dependent linear predictor given by (4.7) are denoted in 
Table 2 by GAMMAiy, i= 1, 2, ..., 15. This represents the 
influence of the baseline hazard incorporating the influence 
of duration. 


Table 2 
Survival Model Parameter Estimation Results 
Parameter Estimate StdError Prob-Value 
Job Tenure (mths) /10 -0.030 0.010 0.0024 
NFLD 0.195 0.426 0.6470 
PEI 0.307 0.490 0.5313 
NS OnF3 0258 0.4940 
NB 0.109 0.293 0.7091 
QUE 0.111 0.117 0.3411 
MAN -0.402 0.253 0.1116 
SASK -0.303 0.213 0.1539 
ALTA 0.270 0.154 0.0798 
BC -0.440 0.148 0.0030 
Self-Employed? 1.665 0.157 0.0000 
Age -0.253 0.041 0.0000 
Aces 2k) 0.043 0.007 0.0000 
First Child? -0.301 0.090 0.0009 
< High School Grad 0.508 0.206 0.0135 
High School Grad -0.124 0.125 0.3212 
University Grad -0.374 0.108 0.0006 
Employed Spouse? 0.109 0.151 0.4703 
Gammal 2.570 0.609 0.0000 
Gamma2 -1.136 0.816 0.1636 
Gamma3 -0.466 0.719 0.5176 
Gamma4 0.780 0.640 0.2232 
GammeS5 1.425 0.627 0.0231 
Gamma6 DISS 0.613 0.0000 
Gamma7 3.640 0.612 0.0000 
Gamma8s 3.413 0.620 0.0000 
Gamma9 3.465 0.630 0.0000 
Gamma10 3.387 0.649 0.0000 
Gammal | 4.579 0.655 0.0000 
Gamma 12 4.285 0.785 0.0000 
Gamma13 3.645 1.110 0.0010 
Gamma14 3.746 1.281 0.0034 
Gammal5 6.215 2.415 0.0101 


log-likelihood = -1165.06 
Number of Observations 341 1 


Observations are given the SLID longitudinal weight from the year of 
birth, scale to sum to the sample size 
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Again, individual and family income variables were 
tested and found not to be significant. Both this finding and 
the importance of a self-employment indicator as a pre- 
dictor of early return to work accord with the findings of 
Marshall (1999). Marshall found that education variables 
were not significant in determining whether a mother would 
return to work within a month. We find however, that 
university graduation has a significant negative effect on 
the hazard (positive effect on duration). Job tenure has a 
significant negative effect on the hazard, possibly reflecting 
its relationship with Unemployment Insurance entitlement 
and job security. 


6.4 Simulation Results for the Benchmarked 
Survival Model 


In the case of the semiparametric survival model, bench- 
marking consists of adjusting all of the terms GAMMAz, 
i=1,2,..., 15 of the previous section according to (5.8) for 
each of the years in the period 1975-1992. The model is 
then simulated as part of LifePaths. 

The frequency distribution of simulated maternity leave 
durations is presented and compared to the corresponding 
observed frequency distribution from LFS data. In order to 
present the results, the frequencies in 3-year periods were 
averaged. A key feature of the frequency distribution is an 
abrupt change apparently due to the introduction of parental 
benefits with Bill C-21 at the end of 1990. Since mothers 
with maternity claims in progress at the time of implemen- 
tation were entitled to parental benefits, the claims 
beginning in 1990 represent a mixture of regimes. For this 
reason the year 1990 is not included in any of the 3-year 
averages. In Figures 2 and 3 we use disjoint 3-year periods 
covering 1976-1984. To balance periods before and after 
1990 using available data, in Figures 4 and 5 we use the 
overlapping periods 1985-1987, 1987-1989, 1991-1993, 
and 1993-1995. 
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Figure 2. LifePaths: Distribution of Leave Durations for 1976-1984 
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Figure 3. LFS Data: Distribution of Leave Durations for 1976-1984 
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Figure 4. LifePaths: Distribution of Leave Durations for 1985-1989 
and 1991-1995 
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The distribution of durations derived from SLID data 
1993-1996 is presented in Figure 6. This may be compared 
with the simulated data shown in Figure 4 for the period 
1993-1995, since no benchmarking is applied after 1992. 
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Figure 6. SLID Data: Distribution of Leave Durations for 
1993-1996 


In Figure 7 we present the average duration of maternity 
leaves beginning in each year of the observed period. The 
average of simulated durations are compared with those 
from the surveys. 


Months 


— LifePaths 


==-LFS SLID 


ee ees 


© a) S 
© SM ge  S o 
id Se PTO RO A 


————<—<——$$—$+ 4 


——r qe 


ie & & & SS S&S S&S S&S co cS Sm S x ea F 
Figure 7. Average Duration of Maternity Leave 1976-1996 


6.5 Evaluation of Benchmarking Performance 


The benchmarking method appears to be very effective 
in the case of the binary logit model. The trend of the LFS 
data is well reflected in the LifePaths simulation. In the case 
of the survival model, the key feature of the LFS data is the 
abrupt shift of the mode of the frequency distribution after 
1990, apparently due to the introduction of parental 
benefits. This shift has been captured by the simulated data. 
Also the average duration of maternity leave in the 
simulation fits the LFS data very closely. 

A noticeable divergence between the simulation and the 
LFS data is the height of the mode at the interval (3, 4] 
months in the frequency distribution of the durations from 


LifePaths from 1982-1989. This may be due to the effect of 
trends in the values of explanatory variables, which we have 
assumed to be stable. Further work is necessary to establish 
this. A possible extension to the model was discussed in 
section:3.3. 


7. CONCLUSIONS 


The technique that we have developed appears to be 
quite successful in benchmarking of the logit and survival 
model parameters so that the essential features of the LFS 
data are captured in LifePaths predictions. The key to 
benchmarking the logit model is the adjustment of the 
parameter corresponding to the “constant term’ in the linear 
predictor that is imbedded it the logistic distribution 
function in order to predict the conditional expectation of 
the dependent variable. Section 3.1 develops the technique 
in a general framework that includes other models of binary 
choice. Particularly, it would extend to the popular probit 
model where a linear predictor is embedded in the standard 
normal distribution function. Benchmarking of the semi- 
parametric survival model hinges on the adjustment of all 
the parameters representing the baseline hazard. Our results 
illustrate how the entire shape of the distribution of dura- 
tions predicted by the model can be made to evolve through 
time according to a pattern revealed by supplementary data. 
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Improved Ratio Estimation in Telephone Surveys Adjusting 
for Noncoverage 


STEVEN T. GARREN and TED C. CHANG' 


ABSTRACT 


Since some individuals in a population may lack phones, telephone surveys using random digit dialing within strata may 
result in asymptotically biased estimators of ratios. The impact from not being able to sample the nonphone population is 
examined. We take into account the propensity that a household owns a phone, when proposing a post-stratified phone- 
weighted estimator, which seems to perform better than the typical post-stratified estimator in terms of mean squared error. 
Such coverage propensities are estimated using the Public Use Microdata Samples, as provided by the United States Census. 
Non-post-stratified estimators are considered when sample sizes are small. The asymptotic mean squared error, along with 
its estimate based on a sample, of each of the estimators is derived. Real examples are analyzed using the Public Use 
Microdata Samples. Other forms of nonresponse are not examined herein. 


KEY WORDS: Asymptotics; Census Public Use Microdata Samples; Post-stratification; Telephone survey. 


1. INTRODUCTION 


Consider surveys where the telephone population is 
sampled. Major problems in telephone surveys include 
nonresponse (i.e., refusal to participate in the survey) and 
noncoverage (i.e., lacking telephone service). Nonresponse 
may cause larger bias than noncoverage, since nonresponse 
propensities are usually much higher than noncoverage 
propensities. However, nonresponse is reviewed rather 
briefly, because the focus of this article is noncoverage. 


1.1 Literature Review 


Khurshid and Sahai (1995) provided an extensive 
bibliography of papers on telephone surveys. Examples of 
nonresponse rates may be found in Steeh, Groves, Com- 
ment and Hansmire (1983, pages 189-197). Corrections for 
nonresponse, using weights and imputation, were discussed 
by Little (1986) and Rubin (1987). Rao (1997) provided an 
Overview of sample surveys, including discussions on 
resampling methods, especially the jackknife, for variance 
estimation. His discussion includes techniques to estimate 
the variance in the presence of imputation. 

Regarding noncoverage, Brick, Waksberg and Keeter 
(1994) found the 94% of the households in the United 
States have phones at any given time. They also found that 
the households with interrupted telephone service usually 
are indigent. Keeter (1995) discussed that in a survey 
conducted from 1992 to 1993 more than half of all house- 
holds without continuous telephone service during that year 
were transient, t.e., these transient households were both 
with and without telephone service at different times during 
that year. He also found that most socioeconomic factors 


(excluding home ownership) for transient telephone house- 
holds are similar to those factors for households which are 
continuously without phones. These similarities between 
the transient and the nonphone populations suggest that 
valid inferences may be made on the entire (phone, non- 
phone, and transient) population, based on telephone 
surveys. Thornberry and Massey (1988) examined non- 
coverage for various socio-demographic groups from 1963 
to 1986, and found income to be the most important factor 
in determining the likelihood that a household has a phone. 


1.2. Our Approach 


Given several various characteristics, such as home 
ownership and household language, the propensity of a 
household to have phone service is estimated in this article 
using the Virginia portion of the 1990 Census Public Use 
Microdata Samples (PUMS), which represent 5% of the 
population. Whether or not households have phones is 
included in the PUMS. The estimation of these propen- 
sities, or probabilities of phone service, is based on genera- 
lized linear regression with a log — log link, since the logit 
link provides a poor fit. We advocate using our fitted 
regression model, with the estimated parameters, for esti- 
mating these likelihoods in general whenever a random 
sample is taken from the Virginian phone population. 

Because it is such a huge data set, the PUMS have 
another useful purpose in this article. The PUMS are used 
to compare and contrast estimators in terms of bias and 
variance, by examining the entire phone population and by 
taking repeated samples of the phone population. Cate- 
gorical data consisting of 75 household and 75 personal 
variables are listed for all individuals in the households 
selected to be in the PUMS. 
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In the examples in section 6 high school graduation rate, 
mean number of cars per household, and mean household 
income are estimated using both post-stratified and non- 
post-stratified estimators for samples of size 500 from the 
PUMS. The post-stratification variables for high school 
graduation rate are gender, age, and race of the head of 
household. The post-stratification variable for mean number 
of cars per household is household income only. Estimators 
of the mean household income are analysed twice. For one 
analysis, post-stratification is on only the race of the head 
of household. For the other analysis, post-stratification is on 
gender, age, and race of the head of household. Each of 
these post-stratification variables is divided into two cate- 
gories, except income, which is divided among three 
categories. 

A serious drawback to estimators not taking into account 
the propensities of phone service is that these estimators are 
not asymptotically unbiased as the sample size gets large. 
A major focus of this article is to show that bias is reduced 
substantially when the estimators take into account the pro- 
pensities of phone service, as estimated by the PUMS. 
Since both post-stratified and non-post-stratified estimators 
as well as both using and not using the propensities of 
phone service are considered, then four estimators are exa- 
mined herein. In particular, these four estimators of a popu- 
lation mean are the sample mean, the usual post-stratified 
estimator, a phone-weighted estimator, and a proposed post- 
stratified phone-weighted estimator. The mean squared 
errors (MSE) of the phone-weighted estimator and the post- 
stratified phone-weighted estimator go to zero as the sample 
size gets large, unlike the other two estimators. 

We adopt a two-phase model for our four estimators. 
The first phase involves selection from the entire population 
into the phone population. We treat the propensity of a 
household to have phone service as the probability that the 
household will be selected into the phone population, and 
we assume that this probability is positive (although 
possibly small) for each household. The second phase is a 
stratified (perhaps geographically stratified) simple random 
sample from the phone population. In the examples in 
section 6, we consider post-stratification by characteristics 
such as race and age of the head of household. Since our 
sample sizes are small, we do not geographically stratify the 
population of Virginia, although our formulas allow for 
both stratification and post-stratification. 

Ideally, one would post-stratify using the same covariates 
used for estimating the propensities of phone service in the 
first phase of our model. In this case, the three estimators 
which use the propensities of phone service and/or post- 
stratification will be almost identical. However, the sample 
size for each post-stratified category should not be too 
small, so practical limitations restrict the number of cate- 
gories which should be used for post-stratification. Never- 
theless, many categories may be used for constructing the 
propensities of phone service from the PUMS, because the 
entire population is used. 


Even if post-stratification by many covariates is feasible, 
the usual variance formulas for post-stratification require 
that a stratified random sample be taken from the entire 
population. In our situation, however, a stratified random 
sample is taken from the phone population, so the usual 
variance formulas are not applicable to our situation. The 
techniques by Politz and Simmons (cf. Cochran 1977, pages 
374-377) require the sampling frame to be the entire 
population, not just the phone population, and hence are not 
applicable to our scenario, which allows noncoverage. 

We derive the asymptotic variances of the four estima- 
tors of a population ratio, and we determine reasonable esti- 
mates of these variances. Since a population mean is a 
special case of a ratio, and a population total is a multiple of 
a ratio, then the results regarding estimators of means or 
totals follow from the results regarding estimators of ratios. 


2. NOTATION 


Consider N households in a population, U. For each 
household in U, let two variables of interest be denoted by y,, 
and y,,, for ke U. At any given time, the event that the Ath 
household does or does not have a phone is treated as 
random, while y,, is treated as fixed. 

Letting 


—~ v-l 
optehiNAlaye Wm 
keU 


for i = 1,2, the goal is to estimate a,, a,, and the ratio 
= a / on 

Without loss of generality we concentrate on estimating a, 

and nu. 

An important special case of estimating a ratio p arises 
when one desires to estimate the mean of a variable z, for ke D 
for some subpopulation DcU but one cannot sample 
directly from D. Examples include subpopulations defined 
by race. Let x, be 1 if ke D and 0 otherwise. Let y,, =z, x, 
and y,, =x,. Then p is the population mean of z, over the 
subpopulation D. 

Assume there are H strata, and h is used to index the 
strata. Assume there are G groups, and g is used to index 
the groups, which are used to construct the post-strata. The 
strata are known prior to sampling, but the groups are not 
observed until after the final sample is taken. Therefore, 
U,,, denotes all households in group g and stratum h; N_, 
denotes the size of U,,, and N, denotes the size of U,. 
Other variables are defined similarly in terms of g and h. 

Let U,, denote the population of households in U which 
currently have telephones, and let N, denote the size of 
U,. The probability, or propensity, that the kth household 
in U is also in U,, is denoted by P,, and we assume that 
p,>9 for all k. A simple random sample of size n, is taken 
from U,, for h =1,...,H. Let s, denote this final sample 
in stratum h. The size of the final sample, s, is denoted by 
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n. For asymptotics herein, we assume that n/N-0 as n> © 
in the same spirit as Saérndal, Swensson and Wretman 
(1992, pages 166-169). 


3. THE ESTIMATORS 


The sampling design is treated as a two-phase design 
with Poisson sampling at the first phase and stratified 
simple random sampling at the second phase. Each indi- 
vidual enters the telephone population with probability p,, 
for ke U, and then enters the final sample according to a 
simple random sample of size n,,h = 1,...,H. The p, are 
assumed known or can be estimated accurately, as shown in 
section 5. The estimators of discussed in this section will 
be validated in the appendix. 


3.1 The Post-stratified and Ratio Estimators 


Post-stratified estimates of a, and Q, are 


H. 4G 

Gs(i Sun) SS Non Non »p Vik? 
h=1 g=l KES.) 

for i=1,2, and the post-stratified estimate of p is 

a hs a ps(1) / G,5(2)° A valid estimate of the variance, condi- 

tional on U,,, is known to be (cf. Sarndal et al. 1992, pages 

270-271) 


a——~ e 


Re = A : 
Vig 7 Ups Yor Ten Ds (1; 7 Bys Y2)) 


JESop 


Dy 


kESo, 


(3.1) 


Although the bias cannot be estimated from the final 
sample, the theoretical bias of pes is well-known to be 


(3.2) 


as n> ~, Noting (3.2), the MSE of Aa. does not go to zero 
in general as the sample size n gets large. 
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To determine the variance and bias of @ ps(1)r SCt Yay = ] 
for all k, so that f,. and up become & and a,, respec- 


s(1) 
tively. One may then apply (3.1) and 3.2) so that 


— Be 1 -(n,/N,,) 
— nN-2 2 h'**Th 
var 6.1) = N »& De N 
h21 g=l No(Ngp 1) 
ai 
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( y P| ope, PY -% O(n”) 


JEU,, eh 


as n->oo, Cochran (1977, pages 134-135) provided a 
correction factor, which is of order n~7, to (3.3). This 
correction factor, however, is irrelevant to (3.1), since the 
error term due to estimation from the ratio is O(n ~). 

As usual, the ratio estimator, denoted by y, pee iS 
defined to be the ratio of the sample mean of y, to the 
sample mean of y,. That is, 


Hiblowge. nlite ane bit 
kes JES 
The post-stratified and ratio estimators are identical when 
G =H =1. Since we will be using only one stratum (.e., 
H = 1) insection 6, we need not reference separate theory 
for the ratio estimator. 


3.2 The Phone-weighted Estimator 


Since the post-stratified estimator, fos is biased, two 
alternative estimators are suggested. One is the phone- 
weighted estimator, which takes into account the proba- 
bility that an individual has a phone. In this section we 
assume that the p, are known for all kes or can be esti- 
mated accurately. Estimation of p, using the PUMS is 
discussed in section 5. 

For a crude estimate of a, for 7 = 1, 2, use 


H 
“a . =] -1 
Qi = N apy Nop Mn eee, Vik: (3.4) 
h=1 kes, 
Then, estimate p by 
i = Gay / (2)? (3.5) 
which is asymptotically unbiased for u, since Deca is 


unbiased for Oy)? for i= 1,2. A valid estimate of the 
variance of fi, is shown to be 
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H Nz, (1 -(n,/Np,)I 


var fi, = [Na] oy 
E REPAY) 
> Vin Ay Ye nee DP lgad A, Yj) . 3.6) 
kes, P, JES, P; 


Since the estimator, fi,,, is asymptotically unbiased, then a 
valid estimate of the MSE of (1, is identical to the estimate 
of the variance. 

ae: Yo; = 1 in (3.4) and (3.5) allows a valid estimate 
of a.,,(1) to be 


Oh) = 3 Nang" Ea] Nn 2 Pa Vik 


kes, kes, 


The variance of @ 


may be estimated by setting Yj, = 1 
in (3.6). 


w(1) 


3.3. The Post-stratified Phone-weighted Estimator 


Another proposed estimator combines post-stratification 
with the phone-weighted estimator, and is perhaps the best 
among the four, when sample sizes are large enough to 
justify post-stratification. This new estimator requires, how- 
ever, that all N, ,, be large enough so that with high probabi- 
lity the N,, are not too small. To estimate a, we use 


H+ :G 
GisctaN by x Ma| © Pj is 7 pen Vik? 


h=1 g=l 


for 1 = 1,2. We then estimate p by : i = Goswc1y/ 


psw(2)° 
The estimate of the variance of fi He 
—2 HG 
—2 
var ee de NG, w ay) a Nal Ds Pj i 
h=1 g=l JESop 


rs =) i 
Pom | » P; | 


JES oh 


(3.7) 
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If any of the n,, terms are small, then one might instead 
prefer the estimator 


aS 
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Notice that if Nroh were known, which is however 
unlikely, then a more familiar and intuitive estimator of var fi Sst 
would be 


=) Hd N. ie 
NO. vol > s 


Var fi aa ee 
“apse Pe 3 N,N, ~ 1) 


al ae 
k 
Nreh KESo, 


2 
a -1)-! -1 A 
Yi Poe“ | »: Pj | De Dx Vg pe BS 2m) . 
JESen MES eh (3.9) 


Since Nreh typically is unknown, then (3.9) usually is not 
a practical estimator. However, (3.9) helps motivate (3.7) 
and (3.8), which are quite practical. 

Since the estimator, fi,.,, 1s asymptotically unbiased, 
then a valid estimate of the MSE is identical to the estimate 
of the variance. Further, setting Yj = 1 ‘in G.A)and G.8) 
allows one to estimate the variance of 4 Rew ey 

When G = 1, the estimator Os does’ not reduce to fi, 
as one might naively anticipate. "The preferred estimator 
when G = 1 is fi,, since fi, is based on only one ratio, 
whereas A. y 1S based on a ratio of ratios. The estimator 
eee requires large sample sizes in each stratum-group 
category, but fi, requires only a large overall sample size. 
When H =G = 1, however, the estimators fi, and fi lsw ar 
identical; the variance estimators based. on fi, are 
preferable to those based on fi, because the estimates of 
the variance of Tew are based on a ratio of ratios. 


4. ASYMPTOTIC MEAN SQUARED ERRORS 


The asymptotic mean squared errors of the estimators 
defined in section 3 now are stated. The proofs follow from 
Taylor linearization and are given in the appendix, along 
with the minor regularity conditions needed. 
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4.1 The Post-stratified Estimator 


To find the asymptotic theoretical variance of the post- 
stratified estimator of u, we first define 


G 
1 Net 


H 
* = . A on -1 
Oia plim, ... Ot 55(i) aN De 


h=1 g 


(4.1) 


for 1 = 1,2, and also define 
fev ace: 


(4.2) 


Note that a, Qa, and yp" #p in general. The asymptotic 
theoretical variance of fos is 


2 
y Pj 7H yy) 
JEU, 


Yip He Vou +O(n?+N7) 


JU sh (4.3) 


as n~ ~. The asymptotic bias of fi ps Was shown in (3.2) to 
be O(1) as n> ~. Therefore, the asymptotic MSE of fs 
is also O(1) asn-~. 


4.2 The Phone-weighted Estimator 


The asymptotic theoretical variance of the phone- 
weighted estimator of 1 is 


ys (V1; 7H Y2;) fe 
yp, Sap Varn Jeu; +O(n2+N7). 


Dip; 


seu, (4.4) 


keU, P, 


Since fi is asymptotically unbiased, then its MSE is the 
same as the right hand side of (4.4). 
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4.3 The Post-stratified Phone-weighted Estimator 


The asymptotic theoretical variance of the post-stratified 
phone-weighted estimator of is 


He G, 
var ee a (Wome ye pas 


h=1 g=l1 


ot p 
uP Na >» (¥,; HY) 
JEU, 


h 


+O(n?2+N7), (4.5) 
Since Bo cw is asymptotically unbiased, then its MSE is the 
same as the right hand side of (4.5). 


5. ESTIMATING THE p, USING PUBLIC 
USE MICRODATA SAMPLES 


The United States Bureau of the Census produced the 
Public Use Microdata Samples (PUMS), which include 1% 
and 5% samples of the population in each of the 50 states 
and Washington, D.C., for year 1990. For each person 
selected in the sample, 75 household variables and 75 
personal variables are listed, where each household has a 
clearly defined head of household. We utilize the PUMS for 
two reasons. We estimate the p, using the PUMS in this 
section, whereas in section 6 we run simulations on the 
PUMS to construct examples for comparing and contrasting 
the estimators. 

In this article, we use the 5% sample from Virginia. 
Since 5% represents a huge number of households, we treat 
this sample as if it were the entire population of Virginia. 
Since we are interested in telephone surveys, then from this 
5% sample we will sample households. Inferences may be 
made on personal variables, such as high school graduation 
rate, and household variables, such as the number of cars in 
a household or household income. Information pertaining 
to whether or not each household has a phone is included in 
the PUMS. We removed from our study all households 
whose telephone status is listed as “not applicable.” Such 
households were either vacant or were group quarters 
(institutions and non-institutions). The number of house- 
holds remaining in 1990 is 110,744, of which 104,606 have 
phones; hence, the proportion of these households which 
have phones is 94.5% 

Using generalized linear regression with a log — log link 
on the 5% sample from Virginia along with the household 
weights assigned in the PUMS, we estimate p,, which is 
the probability, or propensity, that the kth household has a 
phone. McCullagh and Nelder (1991, pages 107-110) 
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recommended the use of a log — log link when the 
probabilities are close to one, and we found that this link 
provided a good fit. We also found that the logit link 
function provided a poor fit. 

The PUMS household weights are used when estimating 
the p, but are not used elsewhere herein. In particular, in 
section 6 when constructing Monte Carlo samples of the 
PUMS population, the samples are simple random samples 
from the telephone population. 


Examples of estimating the p, 


Six covariates, the number of persons in the household, 
tenure (home owner or renter), the date the head of house- 
hold moved into the dwelling, household income, house- 
hold language, and race of the head of household, are used 
to estimate the p,. These six covariates were chosen, along 
with the categories for each covariate, based on a thorough 
analysis of the 1990 PUMS using generalized linear 
regression techniques in SAS. All of these covariates were 
found to be highly statistically significant. Estimates of the p, 
are made by summing the appropriate estimates of the 
covariates in Table 1. The covariate for the number of 
persons should be multiplied by the number of persons in 
the household; however, if the number of persons exceed 


five, then, for computations, convert this number of persons 
to five. For example, if the household consists of three 
English-speaking Asian Americans with two cars in a house 
purchased in 1987, where the household income is $75,000, 
then Table | indicates that the estimate of p, is the solution 
to 


log (—log p,) = 3 x0.2747 - 0.5552 +0.5920 
+ 0.1896 +1.0004 + 0.6156 + 0.0000. 


Notice that in Table | within each of the covariates date 
moved in, number of cars, and income, the values corres- 
ponding to the categories are monotonically decreasing, as 
anticipated, except when income is negative. 


An adjustment which should be made when using 
random digit dialing is to ask each respondent the number 
of phone lines in the household, and multiply that number 
by the estimate of p, from Table | to obtain a new estimate 
of p, Consequently, p, now is a weight, rather than a 
probability. For the simulations discussed in section 6 this 
adjustment is not necessary, since households are equally 
likely to be selected using simple random sampling from the 
PUMS, regardless of the number of phone lines. 


Table 1 
Values of covariates for estimating p,using the Virginia 5% PUMS. Standard errors are in parentheses. If the number of persons 
exceeds five, then convert this number to five. The covariate “tenure” did not appear in the 1980 PUMS. The 1980 category “$40,000 
to $49,999” actually includes “$40,000 or greater”. The “‘other” category for the 1980 covariate “language” includes Spanish. 


Covariate Category 


Number of persons 


home owner 
renter 


1989 or 1990 
1985 to 1988 
1980 to 1984 
1970 to 1979 
1969 or earlier 


tenure 


date moved in 


number of cars 0 
1 
2 
3 or more 


less than $0 

$0 to $9,999 
$10,000 to $19,999 
$20,000 to $29,999 
$30,000 to $39,999 
$40,000 to $49,999 
$50,000 to $59,999 
$60,000 to $69,999 
$70,000 to $79,999 
$80,000 or greater 


English 
Spanish 
other 


income 


language 


race black 
other 


intercept 


0.2747 


-0.5552 
0.0000 


0.9742 
0.5920 
0.3489 
0.2185 
0.0000 


12927 
0.6842 
0.1896 
0.0000 


a 525 
3.1929 
3.4878 
3.0299 
2.4297 
1.8899 
1.5992 
1.2144 
1.0004 
0.0000 


0.6156 
0.4889 
0.0000 


-0.4233 
0.0000 


-7.6707 


1990 Value 1980 Value 
(0.0022) 0.1929 (0.0020) 
(0.0079) -0.7845 (0.0057) 
(0.0000) 0.0000 (0.0000) 
(0.0121) NA 
(0.0119) NA 
(0.0138) NA 
(0.0136) NA 
(0.0000) NA 
(0.0152) 0.8633 (0.0118) 
(0.0143) 0.3981 (0.0109) 
(0.0145) 0.0399 (0.0112) 
(0.0000) 0.0000 (0.0000) 
(0.1294) 2.3639 (0.0830) 
(0.0539) DIS25'8 (0.0260) 
(0.0538) 1.9763 (0.0258) 
(0.0539) 1.0220 (0.0269) 
(0.0543) 0.3889 (0.0317) 
(0.0556) 0.0000 (0.0000) 
(0.0578) NA 
(0.0631) NA 
(0.0704) NA 
(0.0000) NA 
(0.0164) 0.4232 (0.0153) 
(0.0216) NA 
(0.0000) 0.000 (0.0000) 
(0.0064) -0.3837 (0.0058) 
(0.0000) 0.0000 (0.0000) 
(0.0588) -4.9024 (0.0322) 
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Table | thus can be used for estimating p, when con- 
ducting telephone surveys. When a generalized linear 
regression model calculated from a PUMS of an earlier date 
is used to analyse a later survey, rescaling should be 
performed to take into account changes in the distribution 
of household income across time. Table | also gives the 
coefficients of a model calculated from the 1980 PUMS. 
We discuss in section 6 an example when the 1980 PUMS 
model is used to calculate p, for a sample from the 1990 
PUMS population. We note that the 1980 PUMS did not 
include “date moved in” and that a better fitting model 
arose when the language categories “Spanish” and “other” 
were combined. In addition, median household income 
almost doubled between 1980 and 1990, so fewer income 
categories were used in 1980. 

Although Table 1 is convenient when sampling from the 
PUMS and performing simulations, the covariates listed in 
Table 1 might not be available in actual surveys involving 
random digit dialing. One may reproduce Table 1 using 
different covariates, or one may estimate the p, according 
to the following alternative method. 


An alternative method for estimating the p, 


The participants in a telephone survey based on random 
digit dialing may be asked the following two questions: “(1) 
How many telephone lines have been in your household 
during the past twelve months? (2) During the past twelve 
months, how many months was each telephone line in 
service?” Now, let p, be the sum of the answers to question 
(2). For example, in a household with two phone lines, 
where one of the lines was in service all twelve months and 
the other was in service only five months, the estimate of p, 
would be 12+5=17. Again, p, represents a weight rather 
than a probability here. Asking the respondent this second 
question 1s similar to an approach advocated by Brick et al. 
(1994), who also suggested weighting the data to take into 
account the probability that a household has phone service. 


6. INFERENCES ON HOUSEHOLD AND 
PERSONAL VARIABLES 


We will compare the four proposed estimators of ts as we 
make inferences on the high school graduation rate among 
people at least 21-years-old, the mean number of cars per 
household, and the mean household income, in the state of 
Virginia. We performed 100,000 simulations of simple 
random samples of 500 households with telephones from 
the 1990 Virginia 5% PUMS using one stratum (i.e., H=1). 

In section 6.1, two sets of p, are used. One is based upon 
a GLIM regression fit to the 1990 PUMS, and the other is 
based upon a GLIM fit to the 1980 PUMS with the income 
categories inflated by the ratio of the 1990 median 
household income ($32,800) to the 1980 median household 
income ($17,510). Using the 1980 p, to estimate a 1990 
parameter demonstrates how well our method works when 
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GLIM coefficients are used for future data sets, provided 
than an adjustment for inflation is made. Only the 1990 p, 
are used in section 6.2 and section 6.3. 

Post-stratification should be used when the sample sizes 
are sufficiently large. Non-post-stratified estimators may be 
compared to each other, and post-stratified estimators may 
be compared to each other. Comparing y,/y, to fi, is 
appropriate, and comparing A. to ie is appropriate. 
These comparisons show the improvements when using the p, 
in the estimators. 


6.1 Estimating the High School Graduation Rate 


Using the entire 1990 Virginia 5% PUMS, the mean 
high school graduation rate among all Virginians at least 
21-years-old is u=0.75118. When estimating the graduation 
rate using a simple random sample and fi, or Hosws We 
post-stratify by gender (male, female), age (less than 45 
years old, at least 45 years old), and race (black, other) of 
the head of household. The p, are estimated using Table 1. 
The values of the biases and standard deviations discussed 
below are shown in Table 2, when 1990 p, are used. 


Table 2 
Biases and Standard Deviations of Estimates of High School 
Graduation Rate 


Estimator 


not post-stratified post-stratified 


A A A 


y,/¥, , A. Hoey 
0.01471 0.00722 0.01461 0.00874 
0.01472 0.00720 0.01463 0.00850 
0.00000 0.00002 -0.00002 0.00024 
0.00777 0 0.00663 0 


aggregate bias 
telephone bias 
second phase bias 


theoretical bias 


simulated standard 0.01683 0.01737 0.01605 0.01643 


deviation 


estimated standard 0.01680 0.01734 0.01601 0.01635" 


deviation 


theoretical standard 0.01700 0.01752 0.01617 0.01658 


deviation 


root mean squared 0.02236 0.01881 0.02171 0.01861 


error 
The true high school graduation rate is 0.75118. Post-stratification is 
based on gender, age, and race. Samples of size 500 were taken and 
100,000 simulations were performed. 


This value is based on (3.7), whereas the value based on (3.8) is 

0.01610. 

The aggregate biases of the four estimators of p are 
estimated by the average over 100,000 simulations of the 
difference between the estimate from a sample of size 500 
and p. These aggregate biases produced by y,/y,, fi, A, 
and fi, are estimated to be 0.01471, 0.00722, 0.01461, 
and 0.00874, respectively, when 1990 p, are used. Hence 
using the p, reduces the bias of the non-post-stratified 
estimator by 51%, and reduces the bias of the post-stratified 
estimator by 40%. 
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When the 1980 p, are used, similar results arise. These 
aggregate biases produced by fi, and fi psw are estimated to 
be 0.00578, and 0.00856, respectively, when the 1980 P, 
are used. These results, however, are not summarized in the 
tables. 

The telephone bias, listed in Table 2, is the bias obtained 
when the entire telephone population, U,, is sampled when 
calculating y,/y,, fi, fi,., and B swe This bias is caused by 
the fact that U-, is sampled rather than U. Throughout this 
example, we use the convention of listing the estimates 
based on the 1980 p, in parentheses, when these estimates 
differ from those based on the 1990 p,. The telephone 
biases are 0.01472, 0.00720 (0.00577), 0.01463, and 
0.00850 (0.00838), and are relatively close to the aggregate 
biases. 

The second phase bias is the difference between the 
aggregate bias and the telephone bias, and is caused by the 
fact that the estimator approximates a ratio. This second 
phase bias, modulus rounding error, for Val v5; f,, , and 
Bc are estimated to be 0.00000, 0.00002 (0.00001), 
-0.00002, and 0.00024 (0.00018), respectively. Hence, the 
second phase bias is trivial compared to the telephone bias 
for this example. 

The theoretical biases, based on (3.2), of a yy, and loss 
are 0.00777 (0.00905) and 0.00663 (0.00678), respectively. 
These biases differ from the aggregate biases, since (3.2) 
is based on all possible phone populations, whereas the 
aggregate biases are conditional on the one realization of 
the phone population. The theoretical bias is based upon the 
model that each household has a phone with probability p : 
and hence is dependent upon the model used to fit De 
Since fi, and Lee are aymptotically unbiased, then their 
theoretical biases are defined to be zero. 

The simulated standard deviations of the 100,000 simu- 
lated estimates of p for y, Line bles fos and fi, are 
0.01683, 0.01737 (0.01734), 0.01605, and 0.01643 
(0.01634). These four numbers are fairly close to the esti- 
mated standard deviations, which are the squareroot of the 
average estimated variance of the estimator of u, based on 
(3.1), (3.6), and (3.7). Specifically, these estimated 
standard deviations are 0.01680, 0.01734 (0.01732), 
0.01601, and 0.01635 (0.01628), respectively. The esti- 
mated alternative standard deviation, based on (3.8), of 
ee is 0.01610 (0.01606), which again is fairly close to the 
value 0.01635 (0.01628). The theoretical standard devia- 
tions are 0.01700 (0.01697), 0.01752 (0.01749), 0.01617 
(0.01621), and 0.01658 (0.01653), based on the entire 1990 
Virginia 5% PUMS and (4.3), (4.4), and (4.5). These 
theoretical standard deviations also are close to the other 
standard deviations calculated. 


Using the p, reduces the aggregate bias in the non-post- 
stratified estimator by 51% (61%), and in the post-stratified 
estimator by 40% (41%). The standard deviation, however, 
increases slightly. Using the aggregate biases and the 
simulated standard deviations, the root mean squared errors 
of the estimators y,/y,, fi,,, Hi, and fi are 0.02236 


psw 


(0.02236) 0.01881 (0.01828), 0.02171 (0.02171), and 
0.01861 (0.01844), respectively. Hence, using the p, 
reduces the MSE in the non-post-stratified estimator by 
29% (33%), and reduces the MSE in the post-stratified 
estimator by 27% (28%). Notice that there is little differ- 
ence between y,/y, and fi. and between fi, and fi, in 
terms of MSE. Therefore, post-stratification offers little 
improvement. 


6.2 Estimating the Mean Number of Cars per 
Household 


The mean number of cars per household is 1.80397, as 
determined by the entire 1990 Virginia 5% PUMS. Post- 
stratification was based upon household income, using 
categories {less than $20,000, at least $20,000 but less than 
$35,000, and at least $35,000}. The p, are again estimated, 
but this time the covariate “numbers of cars” was excluded 
from the GLIM fit to the 1990 PUMS, since mean number 
of cars per household is what is being estimated. 

As shown in Table 3, the estimates of the aggregate 
biases using 100,000 simulations of 500 simple random 
samples are 0.04872, 0.01629, 0.02226, and 0.01471, and 
the telephone biases are 0.04872, 0.01623, 0.02220, and 
0.01458, for estimators Vet Vor hl ae and Weer respecti- 
vely. Therefore, the second phase biases are rather small. 
Using the p, reduces the bias from the non-post-stratified 
estimator by 67%, and reduces the bias from the post- 
stratified estimator by 34%. Perhaps the reason why this 
latter amount of bias that can be removed is smaller than the 
former is that income is a strong predictor of whether or not 
a household has a phone (cf. Groves 1989, pages 116-119; 
Thornberry and Massey 1988), and the post-stratification 
groups for determining fi ps and fi,.,, are based on income. 


psw 
Table 3 
Biases and Standard Deviations of Estimates of Mean 
Number of Cars per Household 


Estimator 


not post-stratified — post-stratified 


A 


Y/Y, A, A. Lor 
0.04872 0.01629 0.02226 0.01471 
0.04872 0.01623 0.02220 0.01458 
0.00000 0.00006 0.00006 0.00013 
0.03388 0 0.00859 0 


0.04694 0.04764 0.04162 0.04172 


aggregate bias 
telephone bias 
second phase bias 
theoretical bias 


simulated standard 
deviation 


estimated standard 0.04682 0.04753 0.04148 0.04158" 


deviation 


theoretical standard 0.04715 0.04791 0.04152 0.04161 


deviation 


root mean squared error 0.06765 0.05035 0.04720 0.04424 
The true mean number of cars per household is 1.80397. Post- 
Stratification is based on income. Samples of size 500 were taken and 
100,000 simulations were performed. 


This value is based on (3.7), whereas the value based on (3.8) is 
0.04142. 
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The standard deviations of the simulations are 0.04694, 
0.04764, 0.04162, and 0.04172, respectively. The root 
mean squared errors for the four estimators are approxi- 
mately 0.06765, 0.05035, 0.04720, and 0.04424, respecti- 
vely, so using the p, reduces the MSE by 45% and 12% for 
non-post-stratification and post-stratification, respectively. 

We also performed simulations, not summarized in the 
tables, where “number of cars” was retained for the GLIM 
fit to the 1990 PUMS. These aggregate biases for the 
estimators fi, and fi, are 0.00116 and 0.00006, respecti- 
vely, which are much smaller than 0.01629 and 0.01471, 
the respective aggregate biases when “number of cars” was 
removed from the GLIM fit. Furthermore, we feel that 
appropriate analysis requires removing the variable being 
studied (i.e., number of cars) from the GLIM fit to the 
PUMS. 


6.3 Estimating the Mean Household Income 


The mean household income is $40,187, as determined 
by the entire 1990 Virginia 5% PUMS. The p, are again 
estimated, but this time the covariate “income” was 
excluded from the GLIM fit to the 1990 PUMS, since mean 
household income is what is being estimated. 

In Table 4, when estimating household income ee a 
simple random sample of size 500 and Hos One. 
post-stratified only by the race (black, other) of the one - 
household. The estimates of the aggregate biases using 
100,000 simulations are $1,412, $640, $1,192, and $633, 
and the telephone biases are $1,414, $640, $1,193, and 
$630, for estimators y,/y,, fi,, A ps and fi... respectively. 
Thus, the second phase biases are small relative to the 
telephone biases. Overall, using the p, reduces the bias 
from the non-post-stratified estimator by 55%, and reduces 
the bias from the post-stratified estimator by 47%. 

The standard deviations of the simulations are $1,534, 
$1,518, $1,502, and $1,488, respectively. Hence the root 
mean squared errors for the four estimators are approxi- 
mately $2,085, $1,647, $1,918, and $1,617, respectively, so 
using the p, reduces the MSE by 38% and 29% for non- 
post-stratification and post-stratification, respectively. The 
improvements from using post-stratification are more 
minor, according to the MSE criterion. 

In Table 5, we again are estimating household income, 
but this time we post-stratify by gender (male, female), age 
(less than 45 years old, at least 45 years old), and race 
(black, other) of the head of household. Note that the non- 
post-stratified estimators are not affected by this new post- 
stratification. The estimates of the aggregate biases using 
100,000 simulations are $1,173 and $757, and the 
telephone biases are $1,177 and $747 for the post-stratified 
estimators, fi. and Dosw respectively. Again, the second 
phases biases are small relative to the telephone biases. 
Using the p, reduces the bias from merely post-stratifi- 
cation by 35%. 

The theoretical bias for the post-stratified estimator 1s 
$463. The standard deviations of the simulations are $1,445 
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and $1,435, for estimators fi , and fi,.., respectively. The 
root mean squared errors are $1,861 and $1,622, for esti- 
mators fi. and fi felte respectively. Hence, using the p, 
reduces the MSE of the post-stratified estimator by 24%. 

The MSE of a is approximately the same in Table 4 
and Table 5. However, the MSE of fi A, decreases somewhat 
from Table 4 to Table 5. 


Table 4 
Biases and Standard Deviations of Estimates of Household 
Income, Post-stratified by Race 
Estimator 


not post-stratified — post-stratified 


Y/Y, Ay, His Bosw 
aggregate bias $1,412 $640 $1,192 $633 
telephone bias $1,414 $640 $1,193 $630 
second phase bias -$2 $0 -$2 $3 
theoretical bias $789 $0 $586 $0 
simulated standard SIES345e lessee SleoO2se S1EA88 
deviation 
estimated standard S153), 2 ole52 4, .1506, S1LAST 
deviation 
theoretical standard $1,535 $1,518 $1,503 $1,488 
deviation 
root mean squared error $2,085 $1,647 $1,918 $1,617 


The true mean household income is $40,187. Note that y,/y, and fi, 
are independent of post-stratification, so their results are identical to 
those in Table 5. Samples of size 500 were taken and 100,000 
simulations were performed. 


* This value is based on (3.7), whereas the value based on (3.8) is 
$1,490. 


Table 5 
Biases and standard deviations of estimates of household 
income, post-stratified by gender, age, and race 


Estimator 


not post-stratified  post-stratified 


y,/Y, A, His Py 
aggregate bias DL 412 eA S040. oll oad asd 
telephone bias $1,414 $640 $1,177 $747 
second phase bias -$2 $0 -$4 $10 
theoretical bias $789 $0 $463 $0 
simulated standard S154 bi, Leal 44 ee bea 
deviation 
estimated standard S153h oblo2) $1,448 $1435" 
deviation 
theoretical standard $1,535 $1,518 $1,440 $1,430 
deviation 
root mean squared error S208 M5 15647 **S1,861) $1,622 


The true mean household income is $40,187. Note that y,/y, and f,, 

are independent of post-stratification, so their results are identical to 

those in Table 4. Samples of size 500 were taken and 100,000 

simulations were performed. 

“This value is based on (3.7), whereas the value based on (3.8) is 
$1,421. 
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7. DISCUSSION 


We have proposed here to use publicly available large 
data bases (e.g., the PUMS) to develop a model for the 
propensity p, of a household to have a telephone. We have 
used, for Virginia in 1990, a GLIM model with a log — log 
link and predictor variables number of persons, tenure, date 
moved in, number of cars, household income, language, and 
race. 

We have proposed to use the telephone weights p, to 
reduce the bias of estimators due to noncoverage in tele- 
phone surveys. This bias can be expected to occur when the 
variable of interest is related to telephone ownership. The 
examples we have chosen are all variables of this type and 
hence the improvements using telephone weights are better 
than one would expect for variables with little relationship 
to telephone ownership. 

The weights can be combined with post-stratification. 
We have found that the use of such telephone weights 
greatly reduces the bias of both non-post-stratified and post- 
stratified estimators. 

Post-stratification requires a large enough sample size so 
that each post stratum has a negligible probability of being 
empty. Our experiments dealt with samples of size 500, 
and hence the number of post strata was relatively limited. 
Certainly, if one had a large enough sample so that one 
could post-stratify on the same predictor variables as used 
to develop the p,, the use of telephone weights should offer 
negligible improvement over post-stratification. However, 
many nationwide telephone opinion polls use approximate 
sample sizes of 1,000, and we believe for these sample 
sizes, the use of telephone weighs would offer a genuine 
improvement. 

We have also reported results from using telephone 
weights developed from the 1980 PUMS on 1990 data, with 
categories related to household income adjusted for 
inflation. The results are comparable to those for telephone 
weights developed from the 1990 PUMS. Therefore, 
although PUMS data are produced only every ten years and 
might be as much as twelve years out of date, substantial 
reductions in the biases of telephone sampling can be made 
using propensity models derived from older PUMS data 
sets, provided that the categories are suitably adjusted for 
inflation. 

Finally, the PUMS are divided by state and major metro- 
politan areas. This allows separate telephone-weighted 
models to be developed for major geographical units, and 
this would seem appropriate for large surveys. 
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APPENDIX: DERIVATIONS OF EQUATIONS 


Before deriving the equations in section 3 and section 4, 
some regularity conditions must be assumed for sequences 
{G;,, O;,,-..}, for i = 1,2. Further, some lemmas must be 
saat Then, the equations involving the estimators 

a irand pl ia will be derived in the subsections below. 
Whenever the error variable €, is introduced below, then 
6, = O,(1) and E(é,)° = O(1)as k ~ ©. For simplicity (but 
slight abuse) of notation, the sequence {€,, &, ...} will be 
allowed to be different across different equations. 


Condition A: Each a,, represents a sample mean of 
observations such that Ea, -a,=O(k~'), E|a,-a,|>= 


O(k~*”), and a, 0, 20k E18) a5 k= fori 1,2. Let 
M, = a,,/d,, for k =I, Pid 


LEMMA A.1 Condition A implies that Eu,-p = 
O(k !) as k-> &. 


PROOF: Define the function f(y,,y,) =y¥,/yY,. By a 
Taylor series linear expansion, 


H, ~H= O),/05, — 0/0, 
Of(a,,a,) Of (G) 0G) ele 
: Nilo Ueoa of 1) Saker eal Bh Gs 


SMO iy TG As Ne = (Cy 105) (Os ac akan: 


The result follows from Condition A. 


Condition B: The sequence {a 
satisfies 


prey LOR =e 


ip & 


k 1/2 


2 
N [°| 0; p0,5, 
2 
0 PG|0, 


: ae amt 
ko Ey 


LEMMA A.2 Under Conditions A and B, 


2 
for some constants 6}, 65, and p. 


MSEu, = (a,)7 


and 


Var (Gr, = POs Oe Te 


vary, = ca var (@,, —-H@,,) + O(k “ae 


as k- «, 


Survey Methodology, June 2002 


PROOF: By a Taylor series linear expansion, 
H, ~H = 4), /0,, — 4/0, 


Of(a,,a,) ae 


Of(a,,a,) 
-a ———_—— 
1k %) 30 


= (0) mL 
( 2k 2) da 


1 


a mae O° f(a.) RS 

2 (01, — %) a(a,)? 2k ~ %) 

0° f(a,,a,) (i, 0) 
~ +2 (0, — A) (A), — &) ae 
0(a.,) da, 0a, 

eke 


zs ee Ot) i (a,, 7 a) (a,) 
+ (05,0) (05) — (a, -0,) (0, -@,) (a) * 


+k asi S 


3 EO, ~H,) f . Gs (a, - a,) Ka once 
Therefore, 
(u, -p)’ + (a, “a5, -20,' (a, -ay)]+k ae 


which implies that 


MSEu, = ve Var (G},,- 10), ) = 20, 


COV (am ~H ase (a, as a,)} +k BE (A.1) 


Now we will show that the covariance term in (A.1) is 
asymptotically negligible. Since 


ke (Gy - Hoy) dN (0, 03) ia Pee 


2 
for some constant 63, then 


Dy y2 = 
KO — Wu Ty) d 03 X +k Me So 


where ra denotes a chi squared random variable with one 
degree of freedom. Furthermore, 


2 . 
KE GiGs = .05 eds Ne(Op0,)akungs: 
If the signs on a, are negated for i=1,2, then 


k(a,,-H,,)° does not change but k'’(a,,-a,) is 
negated. Therefore, by symmetry, 


COV( Mit tek (0 0,)) = Otke ) 


as k > ~, Hence, 


cov{ (a,, -H CO (Cre Oe) O(k*) (A.2) 
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as k > ©, Combining (A.1) and (A.2) the first part of the 
lemma follows. Since Lemma A.1 implies that 


bias u, = O(k ag 
as k > ~, then the second part of this lemma follows. 


Condition C: Defining a,, = plim,_.@; given U,, the 


Tie oe 


estimator, d.., of 0; satisfies the following, for i = 1, 2: 


Brel U_ Jeter = O(naty; 


Given U7 G ata= O(n!) 
and 
E({ |é, OF |?) pe) OG so) 
as k- &, 
Condition D: Given U,, 


nea 

G., 
v((°). | f | nie : 
0 ae e 


for some positive definite matrix 2, where a. = plim,_. 4, 
given U,. Also, 


E({ 


n 1/2 


Ge tO = On 27) 
as n>, for i = 1, 2. 


THEOREM A.1 Under conditions C and D, we have that 
Var (ly = oe Evari@, =, 0(U,) + Ong +N) 
as n> o, where Ee Oy Oe. 


PROOF: First we determine E var (fi|U,).Under 
Condition D we apply Lemma A.2 to obtain 


var (fi|U;) = a7) var (0, -Hy 6,|U;)+n 76. (A.3) 
Since 
eee yee Nee, 
and 
War (Qn 0510, oa noe. 
then (A.3) implies that 
E var(fi|U,) = 


a; Evar(é,-p,4,|U,)+O(n +n '1N 1?) (AA) 


as n-. Now we determine var E({i| U,). Condition C 
and Lemma A.1 imply that 


E(a|U,) < parse =H Un NA SYS, - 
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Hence, 


var E(fi|U,) = O(n?+N") (A.5) 


as n> ©, Combining (A.4) with (A.5) the result follows. 


A.1 The post-stratified estimator 


Here we derive the equations related to the post-stratified 
estimator, fi, where Gi, (1) and Gs (2) Satisfy Conditions C 
and D. Note that 


He" G 
E(6,.)|UD - Nee ye is ngalce ys Vir 


a 
iT) 

_ 
on 
iT} 

—_ 


for i= 1,2, and we define Lr os 
E(G,.1) [Us IEG, (| U7). Recall the definitions of a; 
and pw" in (4.1) and (4.2). 

Derivation of (4.3), the asymptotic variance of fi. 
Since 


Var (G61) ~ Ur ps Gps al Ups Men) 


Hes Nv, Lhe 


ap >p> my 


h=1 g=l Non (Noon ~ 1) KE Urey 
=) Z 
Vig ~ Ur ps Yan ~ Nren es O71; - Ur ps¥2 j) 
JEU Tey, (A.6) 
then 
Evar (8,54) = [te G.<(2)| es Non) 
2 
HE Na (22) evs 
-N°y° > aoe 
h=1 g=1 nN, \e P; a P; ] 
JEU gy JEU y, 
De ani Oyen 03) 
= Pui Vix EB Vox 
ke gh > P; 
JEU gy, 
O(n? #7 Nt) (A.7) 
as n> ~, Also, since 
E(4@ 6a) ~My ps E502) | U7,N4, ) 
HG 
1 
=e yn Ds N, “Nin Ds ig Ur ps¥2x)> (A.8) 
h=1 g=l kE Urey 
then 
ElvarE (4...) — Ur ps 6.50) | U;,n,,)|Uz] = One A9) 


Since Theorem A.1 and (A.7) imply that 


Wales (GE var (G64) Ur ps bpsc2) | U.N, ») 


+ O(n? a Ne Ty (A.10) 


as n- , then (A.9) implies (4.3). 
Derivation of (3.1), the estimated variance of [i ” 
In light of (A.6) we have the estimator 


Ee Si 
ret je 22 {Yin “Mr ps Yad [Uz "a 
ES, 


licen Ui 


Non (Ne, i. 1) KES oy, 


=I 2 
Noh DS 01; UT ps Yy;) 


JESop 


oe © Pitins Yano 


Using (A.10) the result follows. 


Derivation of (3.2), the estimated bias of {i Pe 
Lemma A.1 implies that 

ot Oa 
as n>», Since 

A ewe | 

EG. =O ta OUN a) 

as N-> o for i = 1,2, the result follows. 
A.2 The phone-weighted estimator 


Here we derive the equations related to the phone- -weighted 


estimator, fi, under Conditions C and D, where @ w(1y and 
0.42) satisfy ( Conditions C and D. Note that 
EQ @|Up) = r Nie Di Vin! Ph 
keU, 
for P=71, 2, and we define ape Me 


EX Gs |U;) /E(G,,5)) | Oo) 


Derivation of (4.4), the asymptotic variance of His 


Since 


var (a wi) Urry Gs U7) 


(Nn, - n,) Nn, 
h=1 1, (Np, ~ 1) 


=-N2 


then 
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Evar (G,.a) “ry Gey U;) 


A.11 
h=1 Nn, oe P; sae ( ) 
JEU, 
pes CP) F 
Vivaldi anites : 
os P, 7 
JEU, P, ys P; 
yeu, 
(A.12) 


+O (n = Na?) 


as n>, Applying Theorem A.1 to (A.11) the result 
follows. 


Derivation of (3.6), the estimated variance of fi = 


In light of Theorem A.1 a valid estimate of varfi, also 
estimates 


(5) val (6,4 — Br wb (2) | U;); 


which is equivalent to 


ACN a} 
(a) *E var Nips eS Mie 7 PT w 72k CRESTS : 
h=1 ny, Kes, Py 


The result follows. 


A.3 The post-stratified phone-weighted estimator 


Here we derive the equations related to the post-stratified 
estimator, fi,.,, under Conditions C and D, where GB osw (1) 


and G.osw (2) satisfy Conditions C and D. Lemma A.1 implies 


H G 
keU. 
A s =i Tgh =i 
E (G47) San! DS 226 INE rei = 
h=1 g=1 Ss P, 
kE Urop 
for al and we define brews 


E(Go.(1) | UE (G,...(2)| U7): 
Derivation of (4.5), the asymptotic variance of fi ae 
Using Lemma A.2 it follows that 


=i! 
ye Pr {Vy x Ur pow Y2K3 


kes 
var e 
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De Pe 


ke Uren 


-] =] a 
Yr Hon ua [ Mo ye Pj | 
UT eh 


JE 


2, 
=| -2 
Nreh ps en Gn 


Yi; ~ Ur psw Ya | 
JE Tgh P; 


=i 
>» Pr {Vig ~ Mr psw Yad 


Therefore, 


(Nay, ~ 1,) = 4H) Oe 
AR 
1 Nren Nereh ~ ) 


Ms Pa 


ke Uren 


Zi = a, 


v} 
Yj Urpsw “s) AGATE 
JE UT ep P; (A.13) 
Since 


sagt ye Dj a ( ye Pi) Ne LO: 


JEU pp keU yy, 


as N ~ 0, then (A.13) implies the unconditional expectation 


= 
BD Px {Vix — Up psw Yad 


E var & z [Up Ng, 
Dy Pr 
KES op 
: 
Ne oe P (>, P, ee, 
JEU gy JEU, 


a -1 2 
DonP: yu -Boa Nn Ds (4, 7 HY) 


ke Ush JE Ugh 
+O (n?+N") (A.14) 
as n> ©. By Theorem A.1, 
var ee 
—2 me Pe 
o no) E var (Gosw() * UT osw Osw(2) | U,, Non? V8, h) 


=) A yy 
a a, E vat[ E (G1 i U T, psw 5 sw(2) | 


Uz, n,,,V8,h)|ny,,Vg.h] +O (n> +N") (a 15) 


gh? 
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as n>, Lemma A.1 implies that 


oe | EG...) 7 UT now Osw(2) | 


Up, Mg,» V8) | Mey V8] = 27%. (4 16) 


Since (4.1) implies that 


A 


var (Got) 7 Ur nsw Fnsw(2) | U;, N) 
oe 
Lee 
Vee, 
h=1 
ai 
C » Pa LVie Ur psw 2d 
= E€ 
Var| Ny Se | : 
gi =| Tne, 
gol yD Px 


then (A.14), (A.15), and (A.16) imply the result. 


Derivation of (3.7), the estimated variance of A osw: 
Observe that 


Nreh be P; N..N N 2 
JEU), Tgh**Th gh 
=] 
Men Nonlh Deus 
kes 


Noting (4.5) the result follows. 


Derivation of (3.8), another estimated variance of 


A 


H osw° 


the result follows from (3.7). 


REFERENCES 


BRICK, J.M., WAKSBERG, J. and KEETER, S. (1994). Evaluating 
the use of data on interruptions in telephone service for 
nontelephone households. Proceedings of the Survey Research 
Methods Section, American Statistical Association 19-28. 


COCHRAN, W.G. (1977). Sampling Techniques. New York: John 
Wiley & Sons, Inc. 


GROVES, R.M. (1989). Survey Errors and Survey Costs. New 
York: John Wiley & Sons, Inc. 


KEETER, S. (1995). Estimating telephone noncoverage bias with a 
telephone survey. Public Opinion Quarterly, 59, 196-217. 


KHURSHID, A., and SAHAI, H. (1995). A bibliography on 
telephone survey methodology. Journal of Official Statistics, 11, 
325-367. 


LITTLE, R.J.A. (1986). Survey nonresponse adjustments for 
estimates of means. International Statistical Review, 54, 139-157. 


McCULLAGH, P., and NELDER, J.A. (1991). Generalized Linear 
Models. New York: Chapman and Hall. 


RAO, J.N.K. (1997). Developments in sample survey theory: an 
appraisal. Canadian Journal of Statistics, 25, 1-21. 


RUBIN, D.B. (1987). Multiple Imputation for Nonresponse in 
Surveys. New York: John Wiley & Sons, Inc. 


SARNDAL, C.-E., SWENSSON, B. and WRETMAN, J. (1992). 
Model Assisted Survey Sampling. New York: Springer-Verlag. 


STEEH, C.G., GROVES, R.M., COMMENT, R. and HANSMIRE, 
E. (1983). Report on the survey research center’s surveys of 
consumer attitudes. Incomplete Data in Sample Surveys, (Ed. 
W.G. Madow, H. Nisselson and I. Olkin), Academic Press, New 
York, 1. 


THORNBERRY, JR., O.T., and MASSEY, J.T. (1988). Trends in 
United States telephone coverage across time and subgroups. 
Telephone Surveys, (Eds. R.M. Groves, P.P. Biemer, L.E. Lyberg, 
J.T. Massey, W.L. Nicholls II, and J. Waksberg). New York: 
John Wiley & Sons, Inc., 25-49. 


Survey Methodology, June 2002 
Vol. 28, No. 1, pp. 77-85 
Statistics Canada 


“7 


Unbiased Estimation by Calibration on Distribution in Simple Sampling 
Designs Without Replacement 


YVES TILLE! 


ABSTRACT 


The post-stratified estimator sometimes has empty strata. To address this problem, we construct a post-stratified estimator 
with post-strata sizes set in the sample. The post-strata sizes are then random in the population. The next step is to construct 
a smoothed estimator by calculating a moving average of the post-stratified estimators. Using this technique it is possible 
to construct an exact theory of calibration on distribution. The estimator obtained is not only calibrated on distribution, it 
is linear and completely unbiased. We then compare the calibrated estimator with the regression estimator. Lastly, we 
propose an approximate variance estimator that we validate using simulations. 


KEY WORDS: Unbiased estimation; Calibration on a distribution function; Conditional inclusion probabilities; 


Weighting. 


1. INTRODUCTION 


It is possible during a survey by sampling to identify the 
values of an auxiliary character for all population units. 
This information may be available when the units are 
selected in a database containing other variables of interest. 
The temptation is then to calibrate the results of a survey on 
this auxiliary information. The decision is made either to 
retain from this auxiliary variable only certain functions 
(moments, sizes) for the purpose of using a calibration 
method (see for example Deville and Sandal 1992 or 
Estevao, Hidiroglou and Sarndal 1995), or this variable can 
be divided into classes with the view to using a 
post-stratified estimator. 

If the decision is to opt for the post-stratified estimator, 
deciding on the strata divisions can be delicate. Theoreti- 
cally, the strata must be defined prior to the selection of the 
sample. Where should the post-strata boundaries be placed? 
What size should the post-strata be? This latter question is 
the most embarrassing because the main problem with 
post-stratification is the possibility of obtaining empty 
post-strata. This means that the post-strata have to be large 
enough so that the probability of obtaining an empty post- 
stratum is negligible. These problems are not limited to 
post-stratified estimators. Indeed, it is also possible to have 
no regression or calibrated estimators for some samples. 

Our objective is to define a new method of using auxi- 
liary information in the population. This method is based on 
the definition of post-strata for which the number of units 
is set in the sample and not in the population. In this way, it 
is possible to import into the estimator complex auxiliary 
information resulting from knowledge of all of the values 
taken by the auxiliary variable, while avoiding both the 
problem of defining post-strata borders and the problem of 
empty post-strata. 


This article is organized as follows. In section 2, the 
notation is defined and in section 3, we describe the 
principle of rank conditioning, which is used to define the 
unbiased estimators in section 4. In section 5, the smoothed 
estimator is defined, and a specific case is examined in 
detail in section 6. Section 7 contains an application of the 
estimation of a distribution function. In section 8, this new 
estimator is compared with the regression estimator and the 
estimator for a simple design without replacement. Compu- 
tation of variance is discussed in section 9. As a result of 
the impossibility of providing an exact solution, an approxi- 
mation is provided in section 10, which is tested by 
simulations in section 11. Lastly, general conclusions are 
presented in section 12. 


2. NOTATION 


We assume a population composed of N observation 
units, with the labelling being denoted as {1,..., k, ..., N}. 
In this population, we are interested in a character of 
interest Y,,keEU. The objective is to estimate the total 
Y=) -y Y, We select a random sample S of fixed size n 
by means of a simple random design without replacement. 
We denote /, the random indicator variable, which takes 
the value 1 if the unit k is in the sample and 0 if not. The 
inclusion probabilities first order are therefore defined by 
Pr(keS) =n, = n/N,keU, and the second order 
inclusion probabilities by Pr(k,/eS)=1,,=n(n-1)/ 
(NN - 1)), k#leU. 

We will be interested in the class of linear estimators of 
Y, which is written as 


Yo i my Wy Y,, 


keS 
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where the weights w, may depend on the sample S and 
therefore be random. 

One of the possibilities is to take w, = 1/m, = n/N, which 
gives the Horvitz-Thompson estimator, 


Y 
See aay, 


keS MT, nN kes 


which is unbiased. 

We will be focussing instead on the more general class 
of conditionally weighted estimators (Tillé 1998, 1999a) 
where the units are weighted by inverses of conditional 
inclusion probabilities. If Z is some statistic, then the 
conditionally weighted estimator 


: ie 


ne 
2° cay ey 


is strictly unbiased if and only if E(/ |Z) >, for all ke U. 
In effect, 


5 E(I,|Z)Y, 
ROG VANE vie Boag 
OR » E(I,|Z) 


Since the estimator is conditionally unbiased, it is also 
unconditionally unbiased. Depending on which statistic Z 
is used, estimator (1) generalizes the stratified estimator as 
well as (a close approximation) the regression estimator 
(see Tillé 1998). 


3. CONDITIONING ON RANKS 


Let us now assume that the NV values Mees ene Xy of 
an auxiliary character x are known for N units of the 
population. First, we assume that all of the X , take separate 
values (this hypothesis will be removed in section 5). The 
rank R, of unit k is 


R, 2 #{leU|X,<X,). 


Moreover, we denote r., j = 1, ...,n, the ordered population 
ranks of the n selected units in the sample, thus 
Tp ne ae ar eel ne r, are random variables with a 
negative hypergeometric distribution (see Tillé 1999b). 
The statistic used to define the conditional probabilities 


of inclusion is a subset of (TOs Tip osey r,}. First, we define 


— an integer gq such that 2<q<n, defining the 
period, 


— an integer b such that 2 < b, defining the border, 


— an integer / such that b <1 < b+q~-1, defining the 
interval. 


The quantities q, b, and / serve to define a subset of indices: 


B.S tr, Vag "102g 0? "eng? "i.Hg) ; 


for [=b,...,b+q-1. 


For example, if n = 18, g = 4, b = 3, then 


Ey = {rly lish, 
Ey = (ly le tell 
Es = {551g 113}; 
Eo = {16 Fige 4)- 


The conditional inclusion probability is computed in 
relation to one of the E.. 

The value of H is defined in such a way that 
1+Hq<n-b+1 and thus H is the largest integer such 
that H < (n-b-I+1)/q. It is clear that H depends on 1. 

The next step is to compute the inclusion probabilities: 


— 


if keE, 


| 


waite = If Tt) <* 
I+hq "1+(h-1)q 


<Tishg, = 1... 


E(I,|E,) = 
BEd if k<r, 
rail 
in - (+Hq) ey eras 
N=, 3 


+Hq 


These inclusion probabilities are thus relatively uneven. 
However, they are all positive, including the borders. It is 
important to use a border b > 2 so that the first and the last 
post-stratum are not empty. 


4. CLASS OF UNBIASED ESTIMATORS 


Since E(/ ale ,) > 0, we can construct an estimator that is 
unbiased and even conditionally unbiased with respect to 
E. By denoting y,,..., Vir Vy the n values taken by the 
units in the sample ordered according to the R,, we obtain 
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keS BUA) 


fie 
a De Meee 
1-1 j=l 


H -] 
Tee ei¢ 
I+hq I+(h-1)q 
+ tS y Hes 4 
[+(h-1)q+ [+h 
yet q- 1 jal ( )qtj q 


N - "iH 


a ae y 
n ~ (1+H@) j-1+Hq+1 


+ 


H 


Pa aime d (Nay Faye * Yong) * Neer Yareay 
where 
No, = "71; 
Nai= "Ishq 7 Ths(h-1)q . I, h= 5 hey, id, 
Nyerjt Nee l14Hq? 
ox ] [1-1 
y Folate ines 
ouauy wy os i 
A 1 q-l 
Yay = h =1,...,H, 
Yai q - (eee Vien l)q+j? 
and 
a ] n 
PHeNT 


ee RIT y. 
n = (1 +Hq) j=l+Hq+! 


This estimator is in reality a post-stratified estimator 
where the sizes of the post-strata are set in the sample. 
Since E(/,|E,) > 0, iG is strictly unbiased unconditionally 
and conditionally to E i which is clearly not the case for the 
traditional post-stratified estimator, because the latter has a 
non-zero probability of having an empty post-stratum. By 
setting the size of the post-strata in the sample, creating 
empty post-strata becomes impossible. The corresponding 
size of the post-stratum in the population is a random 
variable N,,, 

The petite ie has another interesting property. By 
using the He anition of the E(/,|E,), we can quite easily 
show that 

SE 1 


fs BU EDs 


a9 


The estimator is thus calibrated on the size of the 
population. This property, which is also held by the 
Horvitz-Thompson estimator in simple designs, is therefore 
not lost. Units where the ranks are in E, are called pivot 
units, and are assigned a weight equal to 1, which makes the 
weights very unequal. A downside to Y , is the use of widely 
dispersed weights. This problem can be resolved by 
smoothing the estimators. 


5. SMOOTHING ESTIMATORS 


To resolve the problem of the dispersion of the weights, 
we compute a moving average for the estimators as follows: 
b+q-1 


% 


Ge 


i 
"hal: 


Ie, 


Ke . retains all of the properties of the Y ;» Lhis means that it 
is unbiased, calibrated on N and linear and can therefore be 
written as 


n 
= » Wis 


where Ww, 
b+q-1 py —] 
1 l J<b, 
a | 
b+q-1 
es > _Tj+t-b "m-(jrl-b-q) S| Nine (j+l-b-q) Be +] b<j<b+q-1 
q\ 1b j+l-b-m~ (j+l-b-q)-1 


apap | yj Rip =D T A =O. 


ADs 2= GSN Sipaly ll. 


{5p ania i), 
q q-l 


+q-1 
SS ee  j+l-b-q fatter) 


I=b * (j+l-b)- j+I-b-q-1 


1 N+l-r.,_-l 


l = 
a fe Tila (@aeil=|)j=il 
SS N-r 


n+\-l 


; aD bi 


Ly 
Qty teal 


0 rhe getede: 
m (x)= j : 
39 if not 


if x>n-b+1 


x if not (2) 


is Stik 
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Under the apparent complexity arising from the specific 
treatment of the borders, the weighting system is relatively 
simple. In the case where we are not too close to the 
borders, it takes the value 


bsg Mity te ONaip | 
w, es gil jtl-b Vi+I-b-q na 
q\ 1b q-1 
Ble 9 arcs Figg) 
4 FEM a=0 


If all of the values of the auxiliary variable are not 
distinct, we can assign the unit ranks with common values 
randomly. For example, if X, = 2, X, =5, NRHN 5, 
X,=7,X, =8, we select with a probability a between: 
ranks R,=1,R, =2, KR, =3,R,=4, Ro =5, or R, = 1, R, = 3, 
R= 2, R, = 4, R, ee We then Sones the smoothed 
osiiniy, aoe en permutation, and we calculate their 
mean. The advantage of this method is that it preserves an 
unbiased estimator. In effect, for each possible permutation, 
the estimator is unbiased. In practice, it is not necessary to 
compute estimators for all of the permutations. We can 
calculate the estimator for one permutation and then simply 
equalize the weights of the units having the same values for 
the variable x. 


6. CASE WHERE g =2,b=2 


When gq =2, 
calculations 


Y= LS eam 


and b=2, we obtain after a few 


. eee s cata 
2 ph Si 
- l 3 2 - _ 

e ey righ Ye ie Ty+l Tel Y=? ay 
Bs 2 

- Ly met 
Piet 2 
+ 1 5 + V5 > 
+y Fatt ae eae -2r,, +y Us Cae, 
n-1 a n y) z 
where r,=0 and r,,=N+1. This brings us to an 


estimator. proposed by Ren (2000, page 140) and obtained 
using a calibration argument. The way in which the borders 
are managed is the only slight difference. 


Example 1: With a population of size N=20. Let us 
assume that the values of the variable of interest are found 
in Table 1. We also assume that the sample of size n =7 is 
composed of the units with ranks {3, 7, 8, 11, 12, 15, 17}. 


If we take q = 2, 1 = 2, b =2 we obtain EOS Ar, betes 


{7, 11, 15}. Wecan then calculate E(/, | E, = {7, 11, 15}). 
The conditional inclusion probabilities are as follows: 


EQ, B= ena) 
ERE eas ae 
EQ. Ea = Ah Meio eerie 
E(L, Eyles Fie, 15) 1 
EC), |E, = (7, 11415 123, 
Ed (5, Sl igis = i 


EUz|E, = {71s ays) 


i 


Table 1 


Example of a Population of Size N = 20 
ee 


Kee 182935 ANS) ONT V819) | 10 11892) 131401 5116s E Tee ONO 
a ee 
xX, 9 71 72 35.91 143 36 64 38 81 52 78 62 86 16 20 59 84 55 


R, 2 14. 15,56" 20 3 SPLIT 8) 17 SIGN A194 Ween 


The estimator 


Voss af 


ECE, = {iis} 


is therefore unbiased and conditionally unbiased. Further, 
it is linear and 
1 


» E(I, |E, = (7, 11, 15}) _ 


However, if we take g=2,1=3,b=2, we obtain 
E, = {r,, rs} = {8, 12}. Using the same example, we then 
compute E(/,|E, = {8, 12}), and we obtain 

E(1, | Egi=4{8; 12})) =t2/7, 

EUs B= (8,12)) = 277 

EQ,|E, = {8.12}) =1, 

EG ildeehe {8 i1b2)) e173 

JAC bps) i) akciaad hed PA) ae. 

Ea | Bee (Bul es 2/8. 4. 

EEF (Ey 948912) nS 2/8 = 144i 


The estimator 


ara Seren eee 
El, | E, SHonbat) 
is also unbiased and linear. 
Lastly, we compute the mean of the two estimators: 
y Y+Y, 


: 2 
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The weights are obtained simply by calculating the mean of 
the weights of estimators re and Y 6 and have the values 


We, <0 (O.+07/2)/2 = 19/4, 


w, = (1 + 7/2)/2 = 9/4, 
wers(BIs IV2= 2; 
wi, = (1 +392 =2, 
w= 4 1N2=2, 


wis = (1 +4)/2 = 5/2, 
wi = (5 +4)/2 = 9/2. 


Estimator ye is linear and strictly unbiased. 


7. APPLICATION TO THE ESTIMATION 
OF THE DISTRIBUTION 


There are several ways to appropriately use auxiliary 
information to estimate a distribution function. A descrip- 
tion of these techniques can be found in Ren (2000) and in 
Wu and Sitter (2001). The method that we suggest also 
makes it possible to estimate the distribution. The distri- 
bution in the population is defined by 


] 
FQ) = oa Pty, sy}; 


keU 
and can be estimated by 


‘ Dd, 11%, <9) 
FQ) = ——=———_ 


where /{ y < y,} is the indicator function, and the w, are 
the weights allocated to the units k which have the value 
Ney bie aad /n for the Horvitz-Thompson estimator, and 
which are given in (2) for the calibrated estimator. 

Note that the two functions are discrete, but that there are 
far fewer jumps in S than in U. To offset the differences in 
the distributions between the sample and the population, we 
have smoothed the distribution functions by using, as 
Deville (1995) did, a linear interpolation of the centres of 
the risers, which involves defining F,(y) by linking the 
points 


= iG) - F:0.-9}- 


for ke U, where € isa strictly positive, arbitrarily small real 
number. We then define F’,(y) by linking the points 


{F\(y) - FO, - 9}, 


wl 


for the sample. 
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Example 2: A population of size N = 1 000 was generated 
using independent log-normal variables that are equally 
distributed. A sample of size n = 16 was then selected and 
we set h = 5. Figure 1 gives F,(x) in the population. 


2 4 6 8 
Figure 1. Population distribution function 


Figure 2 shows F(x) and the distribution estimated by the 
Horvitz-Thompson estimator. Lastly, Figure 3 shows F’, (x) 
and the distribution estimated by the calibrated estimator. 


2 4 6 8 
Figure 2. Population distribution function and Horvitz-Thomson 
distribution estimator 


=) 


0.8 
0.6 
0.4 
0.2 
Zs 4 6 8 
Figure 3. Population distribution function and calibrated distribution 
estimator 
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8. COMPARISON WITH THE REGRESSION 
ESTIMATOR 


In order to compare the qualities of the proposed 
estimator, a series of simulations was conducted to compare 
the estimator calibrated on distribution with the Horvitz- 
Thompson estimator and the regression estimator. Three 
populations of size 1,000 were generated by means of the 
following models. 


— Model A Linear dependence: The population is 
generated using the model X,~N(0,1) and 
Y,=X,+ 1.33333 xe, where ¢,~ N(0, 1). The 
coefficient of correlation obtained in the 
population is p = 0.616154. 


— Model B Non-linear dependence 1: The population 
is generated using the model X,~N(O, 1) and 
Y= +(O.2ar Xt )-0t 110388338 x¥e! where 
€,~N(0,1). The coefficient of correlation 
obtained in the population is p = 0.28975. 


—- Model C Non-linear dependence 2: The population 
is generated using the model X,~N(O, 1) and 
Van (OA UNE) lo 333docc. where 
€,~N(0,1). The coefficient of correlation 
obtained in the population is p = 0.476158. 


In each population, 100,000 samples of size 100 were 
selected. Three weighting systems were computed for each 
sample. 


1. the weights associated with the simple design 
w,=N/n, 


2. the weights of the distribution calibrated estimator 
given in (2) using the window q = 10 and border 
b =6, 


3. the weights of the regression estimator given by 
Beh) oe 


where X is the total of the X , In the population, 
oe is the Horvitz-Thompson estimator of X, and 


X=X, IN. 


Using these weights, the estimator of the mean and of the 
nine deciles were calculated for each sample. We then 
estimate the variance of these estimators by means of the 
simulations. 

The results are given in Tables 2, 3 and 4. The variances 
were brought to 1 for the simple design. For the linear 
model, the regression estimator is slightly preferable. 
However, in the non-linear case, the distribution calibrated 
estimator significantly increases the precision on the mean 


and on the quantiles. This means that our proposed 
estimator is robust when there is a non-linear relationship 
between the auxiliary variable and the variable of interest. 


Table 2 
Model A: Estimator Variance 
(Reference: Horvitz-Thompson=1) 


Parameter Distribution Regression estim. 
calibration 
Mean 0.674422 0.632608 
Ist decile 0.905273 0.893876 
2™ decile 0.815403 0.802082 
3™ decile 0.842681 0.815071 
4" decile 0.806749 0.768283 
5" decile 0.783731 0.740765 
6" decile 0.818051 0.782549 
7" decile 0.794411 0.773794 
8" decile 0.857114 0.844874 
9" decile 0.884424 0.884032 
Table 3 


Model B: Estimator Variance 
(Reference: Horvitz-Thompson=1) 


Parameter Distribution Regression estim. 
calibration 
Mean 0.429689 0.953025 
Ist decile 0.913598 0.958656 
2” decile 0.919394 1.009270 
3" decile 0.829860 0.987950 
4" decile 0.792094 0.989114 
5" decile 0.703908 0.992023 
6" decile 0.622705 1.009830 
7" decile 0.550028 0.981249 
8" decile 0.443828 1.010340 
9" decile 0.549615 1.029120 
Table 4 


Model C: Estimator Variance 
(Reference: Horvitz-Thompson=1) 


Parameter Distribution Regression estim. 
calibration 
Mean 0.30768 0.808114 
Ist decile 0.95560 0.983582 
2™ decile 0.85920 0.970913 
3 decile 0.73854 0.930401 
4" decile 0.65728 0.950651 
5" decile 0.60500 0.956807 
6" decile 0.52139 0.930514 
7" decile 0.45709 0.907537 
8" decile 0.40752 0.903593 
9" decile 0.39820 0.860050 
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9. VARIANCE AND ESTIMATION OF 
VARIANCE 


To compute the variance of ee we begin by computing 
the variance of Y, Since Ne is unbiased conditionally to E,, 
we have 


V(Y,) = EV(Y,|E,). 


As with each of the post-strata, conditionally to E, the 
design is a fixed-size simple sampling without replacement, 
we have 


H+1 
* 2 a 
V(Y,|E,) = » Nay Y On 
7 2 
>> N2 Nae Mat Ani Shit 
RN, RA Ae (3) 
nl hl 
where 
Noi) = I-41, 
Ny) = a 7 a ee 9 | 
Nyy) = 2 (1+Hq), 
= | fil 
bs eae 4 ’ 
Ol Noi yy (k) 
rb 1 "I+hq-l 
ico rea DOG LY fit Wisalysh, HT 
All K="ecn-1)q*! 
oe 1 N 
Yara feat Fons Ds Yu 
Nay k=N-ry. yg] 
: | — 
Sees GAGs Ts 
oll Noy 71 », (kK) * Ol! 
. 1 r,+h,-1 2 ; 
Sri = CS me OS ripe as Nomen 2 & 
| Ny 1 k=r,+ (10941 ~ . 
and 
2 1 : v7 2 
Sei = ur oa ; pe (Ya, - Yi.1)0) ; 
H+1|1 Sa Ha 


where the Y,) represent the values of Y, sorted by 
increasing order of the X,. 

Note that it is very difficult to calculate the unconditional 
variance of Pe: that is, the expectation of V(¥,|E,)- In 
effect, N,,), and Ss. ; are random. However, we can estimate 
404 |E,) simply A obtain an unbiased estimator of the 
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conditional variance (and thus of the variance) by simply 
estimating (3), by 


Wat All 2 
VY, |E)) = D Na, bla < Ship (4) 
alt als 
where 
I-1 
2 a 
Soli (y; =F 
No\, =1 
2 pale Re, 
Sali iz > Oa-1gsj — Ya ae = l, sey A, 
Nyy 1 jal 
and 
Dh lets 1 ‘ ms F 
Si+1|1 oa a > (Y; — Yue : 


All zi 1 j=l+Hq+l 


The estimator V(Y,E, ) is not only unbiased for V(Y rE, ) 
but also for V(Y,). 


10. APPROXIMATIONS FOR COMPUTING THE 
VARIANCE 


Unfortunately, computing the variance of g becomes 
more complex because of covariances. In effect, we have 


a0 ae al Swe ou Covcys V7). 


gq? I=b  i=b 


When / =i, the problem is to estimate V(Y,), which is 
done easily. When / # i, it is necessary to compute 


CovGlinl Jaze Gov (Yoel jE 


+ Cov (E(Y,|E,), E(Y,|E,)). 
Since E(Y,|E,) = Y, we obtain 


Cov (Y,, Y,) = ECov(Y,, Y,|E,) 


ERY. YE) 1. 


Unfortunately, it does not appear possible to extricate the 
computation of E . Y|E ) and therefore we must find an 
approximation. 

One way is to find a value that is greater than the 
variance. Since 


Cov(¥,, ¥,) < ~V(%) V(X), 
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we have a greater value given by 


Lee VV(¥) V(X.) 


l=b i=b 


b+q-1 2 
=< | (| | 


q l= 


which can be estimated by 


which comes back to estimating the standard deviation of 
the means by the mean of the standard deviations. 

Lastly, a second possibility involves using a residuals 
technique. Generally, when an estimator is corrected using 
a Calibration technique, the variance is estimated by means 
of a residuals technique (see Deville and Sarndal 1992 and 
Deville 1999 on this topic). When computing the variance 
of Y p it is possible to use a residuals technique to obtain 
the exact variance. Consider the variable 


= % 
| Ni (N, All ~My) Joni 


Nn(n-1) Naina, All il) 
v, (2) = 
ik = Thh-lyget? ? "beng-1 
0 if K=Ti.0y-1)q or K=P ag 


which can appear as a residual associated with the estimator 
Me The variable v, (J) inserted in the traditional expression 
o the fixed-size iaple sampling design without replace- 
ment is exactly equal to the conditional variance 4 given 
in (3). In effect, 


Nepali ee y= VCP IE, 


NNLEN 


This variable, however, depends on the Y,, , which are 
unknown. We can estimate v,(/) by 


" 
on)" Ny (N. All Ny)1) _ 
oe Nii nln I) mi xsi 


eed Nee oe 


= 


li-ham1 


0 if j=1+(h-1)q> or jal+hg 


If we insert ?,(/) in the estimator of the variance for the 
simple design without replacement, we obtain an unbiased 
estimator of the conditional variance, and therefore of the 
variance. 


n2Nan ] 3; paeiicdl : 
n 


ii - a4 ~ j = VOCE 
V2 


Deville (1999) shows that the variance of a sum of 
estimators can be determined by adding the residuals 
associated with these estimators, the residuals having been 
computed by linearization. To estimate the variance of ie 
we could therefore simply take the mean of the residuals 

¥(Z), which is written 


l A kes 
——— ee Vv i 
nN fideo » | ‘ n 
These two variance estimators need to be tested by 
simulations. 


11. SIMULATIONS FOR VARIANCE 
ESTIMATORS 


The simulations shown in Table (5) are based on 
populations of size N = 100, that are generated by means of 
normal independent random variables. For each case 
studied, we give the value of g and the coefficient of 
correlation between the variable of interest Y, and the rank R 
of the auxiliary variable X,. The border D is defined by 
taking the integer of g/2+1. Sne our purpose is to validate 
the variance estimator, we use 3,000 samples of size 
n = 20 for each simulation and we compare the variance 
epee by the simulations of the calibrated estimator 

Vo ) with the expectations under the simulations of the 
two variance estimators denoted E, (Ve Ce )),a=1,2. The 
last two columns of the tables show the 1 relative bias defined 
by 


E., ViG® s Vi © 
VY.) 


The simulations show that the two proposed estimators 
Overestimate the variance. The overestimation appears to 
diminish as q increases. The estimator V, (Ve ) definitely has 
the smallest bias. We will therefore piste to use V, (Y, ve 
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Table 5 
Simulation Results 

Correlation: 0.802 
Pe ay) a, REY RB, (0) 
4 0.045 0.065 0.054 0.444 0.200 
5 0.045 0.066 0.057 0.467 0.267 
6 0.056 0.076 0.070 0.357 0.250 
7 0.058 0.079 0.059 0.362 0.017 
8 0.063 0.088 0.087 0.397 0.381 

Correlation: 0.481 
GVO SERV MER (7) RELY) (TRB) 
4 0.048 0.066 0.059 0.375 0.229 
5 0.045 0.060 0.054 0.333 0.200 
6 0.044 0.056 0.051 0.273 0.159 
a 0.044 0.,054 0.051 0.227 0.159 
8 0.045 0.052 0.048 0.156 0.067 

Correlation: 0.111 
Cage ce) PAV ey BEV) FS) 2 RB OV, CY) RB V7) 
4 0.281 0.471 0.363 0.676 0.292 
5 0.297 0.420 0.356 0.414 0.199 
6 0.279 0.363 0.316 0.301 0.133 
q 0.267 0.342 0.324 0.281 0.213 
8 0.282 0.327 0.281 0.160 -0.004 


12. CONCLUSIONS 


Our proposed estimator is one of the rare estimators that 
is both unbiased and linear, that uses auxiliary information 
and that is calibrated on the size of the population. It can be 
parameterized using the width of window gq. This new 
estimator is robust compared with the regression estimator. 
It can take into account auxiliary information with a 
non-linear relationship with the variable of interest. Most 
simulations appear to show that the width of the window 
does not significantly impact the preciseness of the mean 
estimator. However, it also appears that a small window 
width is not penalizing, even if there is no dependence 
between the auxiliary variable and the variable of interest. 
The smaller gq is, the tighter the calibration, and the variance 
estimator will then be significantly penalized because the 
degree of freedom is lost in each post-stratum. The choice 
of g must therefore reflect this problem. 

There are many other methods that allow for the use of 
the information given by a distribution function (see Ren 
2000) to improve an estimator. The results that we have 
presented are limited to simple sampling designs, but we 
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think they are important just as post-stratification is 
important as a specific case of calibration techniques. 
Post-stratification is one of the few examples where it is 
possible to show with accuracy that calibration corresponds 
to a conditional approach. Further, our approach can be 
seen as a calibration on a distribution function providing an 
unbiased estimator. A good general distribution calibration 
technique should therefore include in simple sampling 
designs the method we have presented. 
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Variance Estimation for the Current Employment Survey 


JUN SHAO and SHAIL BUTANTI’ 


ABSTRACT 


Like most other surveys, nonresponse often occurs in the Current Employment Survey conducted monthly by the U.S. 
Bureau of Labor Statistics (BLS). In a given month, imputation using reported data from previous months generally provides 
more efficient survey estimators than ignoring nonrespondents and adjusting survey weights. However, imputation also has 
an effect on variance estimation: treating imputed values as reported data and applying a standard variance estimation 
method leads to negatively biased variance estimators. In this article we propose some variance estimators using the grouped 
balanced half sample method and re-imputation to take imputation into account. Some simulation results for the finite 
sample performance of the imputed survey estimators and their variance estimators are presented. 


KEY WORDS: Balanced half samples; Non-negligible sampling fractions; Ratio imputation; Stratified sampling. 


1. INTRODUCTION 


The Current Employment Survey (CES), commonly 
known as the payroll survey, is conducted monthly by the 
U.S. Bureau of Labor Statistics (BLS). The data are 
obtained from establishments on a monthly basis by various 
automated methods including computer assisted telephone 
interviews, touchtone data entry, FAX, electronic data 
interchange, mail, etc. The main variables are the employ- 
ment, production or non-supervisory workers and their 
working hours and earnings on nonagricultural establish- 
ment payrolls. Population employment counts are obtained 
once a year from Unemployment Insurance administrative 
records. 

Nonresponse often occurs in the CES. In any particular 
month, imputation using reported data from previous 
months generally provides more efficient survey estimators 
than using reported data in the current month only and 
adjusting survey weights. This is particularly true in the 
CES because the nonresponse rate is about 60-80% and 
about 60% of the nonrespondents in a given month may 
become available one or several months later so that these 
data can be used as “reported data from previous months” 
(historical data) in a future month. 

However, it is well known that treating imputed values 
as reported data and applying a standard variance esti- 
mation method leads to biased (often negatively biased) 
variance estimators. Valid variance estimators can be 
derived under some assumptions on sampling designs, 
imputation methods, and response mechanisms (and, 
sometimes, models that generate data). 

The purposes of this article is to study variance 
estimation for the CES. After describing the sampling 
design and the imputation procedure currently used for the 
CES in section 2, we derive valid (asymptotically unbiased 
and consistent) variance estimators for imputed survey 


estimators in section 3. To simplify the computation of 
variation estimators, we propose some approximations in 
section 4 and study their properties by simulation in section 
5. Some conclusions are made in section 6. Although our 
study is motivated by the CES, we believe that our results 
are general and applicable to any survey that adopts a 
similar sampling design and a similar imputation method. 


2. SAMPLING DESIGN AND IMPUTATION 


The CES adopts the following stratified probability 
sampling design. Let P be a finite population containing a 
set of establishments {1, ...,N}, which is stratified by the 
type of industry and by the size of the establishment. Within 
the Ath stratum, a sample of size n, > 2 is taken without 
replacement from N, population units, using probability 
sampling independently across strata. The sampling 
fractions n,/N, are not necessarily negligible; for some 
strata with large establishment sizes, n, = N,. Let S denote 
the sample. For ie¢ S, at month ¢ = 0, 1, ..., 7, values on the 
number of employees ( Yi) the number of non- supervisory 
workers ( wal the number of hours worked (,, ; cay and the 
weekly pay (y, ; ") are observed (if the there is no 9, nonresponse). 
Let y,,; denote any of ye ve wee or ve In CES, the 
main peers of interest are population totals 
Y,=Viep¥,i2 t= 15.7. Since population totals can be 
obtaine once a year from administrative records, we 
assume without loss of generality that Y, is known. If there 
is no nonresponse, Y, is estimated by a ratio estimator 


Fk) Fw / LD * oi ae Reread Be (1) 


ieS 


where w, is the survey weight for the ith unit in the sample 
and the Ath stratum. 
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In our research, starting from month 1, nonrespondents 
are imputed using the imputation method proposed in 
Butani, Harter and Wolter (1997), as described below. 
Imputation is carried out within an imputation cell, which 
is the same as stratum or a union of strata. Imputed values 
in months 1,...,¢ -1 are carried over to impute non- 
respondents in month ¢, unless nonrespondents in months 
1, ...,¢ - 1 become respondents prior to month t. 


1. The number of employees. If iA is a nonrespondent, 
it is imputed by 


where y y, Li =e 1,; (feported value) if Vir, 1,; 1S available 
at month ¢ and otherwise j y,-1,; 18 an imputed value, 


JER, 
and R, is the set of all reporting units for months t and 
cao be 


2. The number of non-supervisory workers. If se is a 
nonrespondent, it is imputed by 


Wie Ws ee E,~E 
Vas VI Via 


where Vad ; 1S defined similarly to ee i 


3. The number of hours worked. If yi is a non- 
respondent, it is imputed by 


GU eG aie te Wiad 
Vni — Ve Vet Ye Veni 
~H . ,. ate ~E 
where y,_, ; is defined similarly to y,", ; and 


drat Xe We Yay 
E w/e wey tien 


4. The weekly pay. If We ; 1S anonrespondent, it is imputed 
by 


t 


ae pay ye 


AEP) Bi. ; Hae NE) 
where y,_, ; is defined similarly to y,", , and 
P H 
De Wee OL 
. JER, JER, 
ce oe | eae 
De aes We vreiy 
JER, JER, 


Once nonrespondents are imputed, estimated monthly 
totals are calculated according to (1) by treating imputed 
values as reported data. 


Assume that the population P is divided into K disjoint 
imputation cells |e me a x and for each k, 


V11,7 7? 


E(¢,)=0, “ieP, tee 


iis = Ok Veal i i 
EO, i) = Hy ee 


VinOni) = Ver (2, ;) = Op (2) 


where y, ; denotes any of Nea Veen , OF Yep E,, and V, 
are the model (marginal) expectation and variance, respec- 
tively, a, and o, are unknown parameters, e, ,’s are iid 
and the two processes {y,,} and te, J ate independent. 
Within each P,, it is assumed that the response indicator 
a, , (=1 if y, ; 18 a respondent and = 0 otherwise) and Ye i 
are independent, given y,_ oe Cees = 1,2, ...,t. Under this 
response mechanism, which is is unconfounded 
response mechanism (Lee, Rancourt and Sarndal 1994), a, ; 
and y,, are dependent, but through Vp-s,i? Ips, i? 
s=1,2,...,¢, It is more general than the assumption that 
Ona 4 oe and (aa. ,a,,) are independent. Finally, 
response indicators from different units are assumed to be 
independent. Under these assumptions, the estimators Y : 
based on imputed data as described in the previous section 
are asymptotically unbiased with respect to the joint 
expectation under model (2) and sampling from the finite 
population. 
In the CES, the imputation cells are unions of strata so 


that 
S w,=M,, 
ie SOP, 


where M, is the number of population units in the kth 
imputation cell P,. Consequently, the Y, are conditionally 
unbiased with respect to the model expectation (given S), 
1Lé., 


Ela) . 


3. VARIANCE ESTIMATION 


Let E, and V, be the sampling expectation and variance, 
respectively, and V be the overall variance. Then 


V(Y,-Y,) =E,LV,,(¥,-¥)1)+ VLE, (Y, - Y,)] 
=UB, [VaVje Yow, (3) 


since Bay. -~Y,) = 0. Furthermore, it is shown in the 
Appendix that 


Vilas Eee Wea Vela (4) 
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Note that (4) is obvious in the case of no nonresponse. 

Because of (3) the estimation of 404 - Y,) is the same 
as the estimation of V, (Y, - Y,). Also, because of (4), we 
can first derive estimators v,, and v,, of V,,(Y,) and 
V_,(Y,), respectively, and then eke the difference Lenpeerey 
as ¢ our variance estimator for Ae Since V_ (ye ) is a condi- 
tional variance, given S, we do not need to consider the 
sampling fractions n,/N, in the estimation of V_, (Y,). 

We first eensider the estimation Of V,. (Y,). If an 
approximate formula of V,, (eK ) can be derived, then we 
can directly estimate V,_, (Y, ) by substitution. The explicit 
form of Ie however, is very complex when t is not small so 
that the derivation of V_(Y,) is very difficult. Thus, in the 
CES we adopt a grouped half sample method that incor- 
porates Rao and Shao’s (1992) adjustment (or re-impu- 
tation) to take imputation into account. Specifically, 
sampled units in each stratum are randomly grouped into 
two groups. R half samples are created using a Hadamard 
matrix, where H+1<R<H+4 and H is the number of 
strata. For the rth half sample and the ith sampled unit, 
define 


(1+0.5) w, if the unit is in the rth 


iy half sample 
w r 


(1 -0.5) w, if the unit is not in the rth 
half sample, 


where w, is the original survey weight. Let Ne ?” be the same 
as 4 except that the weights w. are replaced by the we 
including the weights used in imputation (LeciQony 3 and B, 
are re-computed for every r, which is equivalent to Rao and 
Shao’s adjustment). A grouped half sample variance 
estimator of VAY ans 


(5) 


Note that the use of 0.5, instead of 1, in the construction 
of wns is based on Fay’s method (Dippo, Fay and 
Morganstein, 1984; Judkins 1990; Rao and Shao 1999). 
Asymptotically, v,, is unbiased and consistent for V,, (Y,) 
(Shao, Chen, and Chen 1998; Rao and Shao 1999; "Shao 
and Chen 1999). 

We now consider the estimation of V,(Y,). Under 
model (2), 


Ve (Y,) = DS MY, 4 
k 


which is of the order O(N), where N is the size of the 
population P. Usually V,, (Y,) is of the order O(N? /n), 
where n = ios is the sample sizesHence V_(Y)/V,, (Y,) 
is of the order O(n! N) and the estimation of wes (Y, ) is not 
necessary if n/N is negligible (although some sampling 
fractions n,/N, are not negligible). 
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In the CES, however, n/N is around 15% and is not 
negligible. Hence, the estimation of V, (Y,) is necessary. 
An asymptotically unbiased and consistent estimator of 
VY) is 


y) 
VpUR Des M,Sy1> 
k (6) 


where 5, is the usual sample variance based on the 
respondents y, , in the kth imputation cell. 


4. APPROXIMATE VARIANCE ESTIMATORS 


From section 3, a correct variance estimator for x 5 iS 
V1.7, Where v,, and v,, are given by (5) and (6), 
respectively. Although v,, can be easily extended to the 
case where @ is replaced by some nonlinear estimator such 
as i TB 2 (the ratio of weekly pay over hour), the 
extension of v,, involves the derivation of Taylor 
expansion for each separate nonlinear estimator. Thus, for 
the CES, it is desired to derive an approximate variance 
estimator that is not exactly correct but does not require the 
computation of v,,. 

Note that if n /N is negligible, then we can simply use 
v,, aS an estimator of VY, - Y,). In the CES, however, 
using v,, leads to overestimation, since n/N is not 
negligible (see also the simulation results in section 5). 
Since this overestimation is caused by the sampling 
fraction, a possible way to fix the problem is to incorporate 
sampling fractions in the half sample method. When there 
is no nonresponse, sampling fractions can be incorporated 
into the half sample method by using formula (2) with we : 
replaced by 


qd +0.5 /1 -n,/N,, )w, if the unit is in 


the rth half sample 
Ws 


(1-0.5 /1-n,/N, )w, if the unit is not (7) 


in the rth half sample, 


when / is in stratum h. 

Let v., be the variance estimator obtained using (5) but 
with w; y replaced by wm”. If we use v,, as an estimator 
of vif. - Y,), however, it has a negative pias although it is 
better than the naive estimator that treats imputed values as 
observed data (see the simulation results in section 5). 

While v,, overestimates and v,, underestimates the true 
variance VY, - Y,), a compromise is to replace the 
sampling fraction n, /N, in (7) by the “estimated sampling 
fraction” r, ,/N,; “hoe r,,, is the number of respondents 
in stratum h at month t. Ber ¥,, be the variance estimator 
obtained using (5) and (7) but mn n, /N, in (7) replaced 
by r,,/N,- Then 
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All three variance estimators are asymptotically unbiased 
and are approximately equal when n/N is negligible. When 
n/N is not negligible, however, they are asymptotically 
biased. 

To see the magnitude of the biases of » Vi19V,;> and Ue 
we consider the simplest case of no strata and t = 1. Let 


E E 
Yi = Viji %; = Yo; and 
Y = » ay, wy (1 -a,)Rx,, 


where a; = 1 if y, is a respondent and a, = 0 otherwise, 
R = Ya, y, /¥\.a,x,,and all summations are over iS. Let 
U = (Vx, In) (Sa, x,/r), where r is the number of 
y- respondents. Then the correct variance estimator for Y 
is vv, with 


ner N?0°s; QR ee ee 
wih vais se yt peeaPt ae tees ett 
r n 
and 
v, = NUs; + 2NURs,+NR’s? 
where 55 = (r-1)! 4, Rapes a= (7 a1) 


AG TS -Rx. ;), and s, is the sample Cte based on 
XFS. Also, 


el N?2U0°s? 2N?URs,+N?R s- 
V,= ee ——— | a ee 
: N r n 


nNU’s. rs eee 
lee -2NURs,, SIVA SS, 
r 
and 
, ite INE tsa 2NZOR SN? Ros, 
Va = ppaecs i a et 
| N r n 
2 2 2rNURs,.+rNR’s, 
aa -NU Sy 7; 
n 
Since v, —v, is asymptotically unbiased, the bias of VEY, 


is of the wii order as v, and is always non- ene ative: the 
bias of V,,= v, is of the tins order as 


Ra eye is =a;) x; 
aE SF ED t bedtinex 


and is always non-positive; and the bias of ¥., =¥ is of the 
same order as 


NU O39 +{1-2] bv Rs,, + NR? s?) (8) 
n 


The bias in (8) is non-negative if s dx 2 O and Ux=1 (which 
is true if a, is independent of 7) 


5. SOME SIMULATION RESULTS 


To further study the biases of the variance estimators 
Vip V,,; and Vis we conducted a simulation study using a 
CES dataset (from 1980’s) of 149,044 units as the popu- 
lation vw Lach unit i€P has a vector NA 
(y, a Aas Pests cnn: t=0,1,..., 7) andavector r. , consisting 
of response indicators of the corponents of y., although all 
values of y, are available (from administrative records). 
The sample Si in the simulation was obtained by generating 
a stratified simple random sample { y ,} of size 23,092 from 
P according to the sample allocations listed i in Table 1. The 
response indicators of {y,} in the simulation were 
generated by drawing another (independent) stratified 
simple random sample {r,} from P. Thus, nonrespondents 
in the simulation were random and distributed according to 
the values of the r,’s in the dataset P, but independent of 
the y.’s. 

After the sample data and nonrespondents were 
generated, nonrespondents were imputed as described in 
section 2. Estimated monthly totals Y, and monthly changes 
Y_ - Y ,-) Were calculated based on imputed data and their 
Variance estimators, v,,,¥.,, Vy and v.-v,, were 
computed as described in sections 3 and 4. For comparison, 
the naive variance estimator V,9> Computed by treating 
imputed values as observed data, was also computed. 

Based on 1,000 simulations, the relative biases (RB) and 
variances (Var) of the estimated totals Y , and changes 
Y - 4 -p the RB and coefficient of variations (CV) of the 
variance estimators for Ys and Y - Y ,-y> the coverage proba- 
bility (CP) of the approximate 95% confidence intervals of 
the form 


the estimate + 1.96 the estimated variance, 


and the width (MW) of the confidence interval are given in 
Tables 2 through 5 respectively for 4 different variables. 
Estimated simulation standard errors are 2% for REC ve 
and MW, and 0.5% for CP. 
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Table 1 
Sample Size by Stratum 

Stratum Sample Sampling Stratum Sample Sampling 

SIC SIZE Size Size Fraction SIC SIZE Size Size Fraction 
10, 12-14 1 567 14 0.02439 50-51 1 3631 66 0.01812 
2 433 303 0.70000 2 3678 183 0.04987 

3 526 526 1.00000 3 4300 403 0.09375 

4 210 210 1.00000 4 1831 289 0.15789 

5 165 165 1.00000 5 833 320 0.38461 

15-17 1 5055 129 0.02549 52-59 1 7084 149 0.02103 
2 4476 570 012731 2 5701 440 0.07724 

3 5281 1154 0.21854 3 8363 1037 0.12403 

4 ZAR 836 0.39583 oh. 4511 763 0.16915 

5 1005 1005 1.00000 5 4087 1002 0.24528 

24-25, 32-29 1 3103 73 0.02349 60-62, 67 1 1384 17 0.01230 
Z 3905 331 0.08475 2 971 38 0.03906 

3 6381 891 0.13966 3 1529 115 0.07500 

4 4273 1036 0.24242 4 981 67 0.06818 

5 4143 2127 0.51351 =) 728 73 0.10000 

20-23, 26-31 1 1754 40 0.02276 63-64 1 1364 15 0.01119 
2 1953 128 0.06564 2 652 20 0.03125 

3 3521 524 0.14599 3 754 87 0.11538 

4 3108 596 0.19167 4 435 48 0.11110 

5 3448 1041 0.30189 =) 344 D7, 0.16667 

40-49 1 1648 3] 0.01902 7, 70-99 l 9641 230 0.02385 
2 1463 101 0.06918 2/ 6701 643 0.09602 

3 1988 P24, 0.11111 3 7833 1275 0.16275 

4 1171 211 0.18033 4 4839 1317 0.27215 

5 759 108 0.14286 E) 4352 2067 0.47500 

Table 2 
Simulation Results for Employment 
Estimation Variance estimation for estimated total 

olgtal V0 aa Vn Vi Var V2 


Month Total* RB Var* RB CV CP MW RB CV CP MW RB CV CP MW RB CV CP MW RB CV_ CP MW 
1 6.7E6. «0.0 ©5:5E7 =37.0.47.6185:3" 7.7 § -4.1'°67.5 92.3 -9.2, «4.9. 69:8°93.19 9.6: 19:5 76.1 95.1, 10:3; 7:4 67:4°92.8 97 


2 6.8E6 0.0 8.8E7 -34.3 28.8 86.9 9.6 -7.3 40.4 92.6 11.4 0.9 42.9 93.6 12.4 15.3 47.6 94.7 12.7 44 49.1 92.3 12.1 
3 6.9E6 0.0 1.4E8 -26.1 30.4 88.2 12.9 -4.1 42.3 91.8 14.7 1.4 44.2 92.9 15.1 18.8 49.9 94.8 16.3 3.6 50.5 90.8 15.2 
4 6.9569 0.0 2:1 BS 0-22. 5.0329 89:3016.4 922.4544.0..92.1- 18.19 3.8 46.33 92.7.418.7 > 22:3: 53.1 94.7; 20,3, 92:7. 5132914 18.6 
> 6.9E6 0.0 2.7E8 -21.9 35.0 88.3 18.4 -7.7 45.2 90.9 20.0 -1.1 47.9 92.0 20.7 16.2 55.6 94.4 22.4 -4.7 54.2 90.9 20.3 
6 G9E6 ©0.0 +2.0E8 po -8.8240:5791.917.1 25.2 841.7 391.9: 17.40 $0.0 43.6: 93.1517.9) 19.7.451.8; 95.5, 19:6 3:1 52:5°905 17.6 
7 6.9E6 0.0 1.5E8 -12.4 34.8 91.8 14.5 -8.6 36.1 92.5 14.8 -2.0 38.3 93.6 15.3 16.8 45.0 96.2 16.7 -6.6 42.4 92.7 15.0 

Estimation Variance estimation for estimated change 

pi change %0 Yn Yn as Yael, 


Month Change* RB* Var RB CV CP MW RB CV CP MW RB CV CP MW RB CV CP MW RB CV CP MW 
S04 60.1 BGAET 0-43,0125.4484:00 17.5 £013 241.4092 3 2 9:35.0et'S 43-9193.79 19:74, 69.4 487 95.6 10:3" P86" 51.7 293.5) 10.3 
OFBA S21 8) T4E7 /-35.0 -31,785.08 (8.7. 285 <46.0190.5 10.4. 53:2 47-7) 910° 10:7" 14.7 53.1 93.4 10-5 53.1 “48.8 .90:9° 10.7 
1:8E4 + 2.9 1,1E8 »)-31,8 42:3)'87.4 11.0 920.9 60.6 93.1. 13.2; 4:9 63.2 93.6:13.6 25.0 73.5 95.9 14.8 -2.5 47.7°89.9 13.1 
4.4E4 3.4 1.1E8 -41.9 34.5 83.1 10.1 -10.8 57.3 91.4 12.5 -4.9 60.4 92.3 12.9 13.2 69.4 94.6 14.1 0.8 94.1 93.1 13.3 

-1:1E4 9.3. 1.1E8  -41.0°29.9 84:1 10.2 -12.6 42.0 91.1 12.4 -6.4 44.2 93.0128 9.4 50.2 94.6 13.9 -4.1 53.9 93.0 13.0 

7 1.683..3.201.288--43.8 38.4.82.9 10.4 -15.9 575 89.6 12.7 -11.3.60.190.5 13.1) 25.6.69.9..92.6.14.2).5-0:2075.5, 9010 113.8 

Total: population total. 

Change: population difference between the current month and the previous month. 

Var: variance of the estimated total or change. 

RB: relative bias = 100(bias/true value)%. 

CV: coefficient of variation = 100 (standard error/true value)%. 

CP: coverage probability of asymptotic confidence interval using estimated variance (in %). 

MW: (mean width of asymptotic confidence interval)/10’. 

*: Scientific notation (for example, 6,700,000 is 6.7E6). 


Nn FW WN 
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Table 3 
Simulation Results for Non-supervisory Workers 
Estimation Variance estimation 
of Total oi e Ss 
Yio Vy, Vi Yn Laer de: 


t 

Month Total* RB  Var* RB CV CP MW RB CV CP MW RB cv. CP MW RB CV CP MW RB CV CP MW 

] 5.4E6 -0.1 4.6E7 -33.3 49.7 80.9 7.0 -4.4 66.1 88.1 84 46 68.6 89.9 88 19.8 75.5 92.3 94 3.4 653 899 87 
5.5E6 -0.1 7.6E7 -30.6 31.4 84.0 9.2 -7.4 41.1 89.4 106 0.9 43.7 91-0"11.1 15.8°48.7 93.8 1F9 94.2 50:8 68 a0n1 
5.6E6 -0.1 1.2E8 -23.6 31.2 85.6 12.1 -4.8 41.0 89.5 13.5 07 42.9 90.0 13.9 18.4 48.7 93.1 15.1 3.9 50.9 90.8 14.1 
5.6E6 -0.1 1.9E8 -19.0 34.5 88.4 15.7 -2.4 43.8 91.7 172 38 46:3 191,997.89 225953.2°94.1 193 1:8 71-7090. Seta 
5.7E6 -0.1 2.4E8 -18.9 36.8 87.8 17.6 -7.1 45.3 89.7 189 -0.4 48.2 90.7°19.6 17.2°56.0 93.0 212 -4.1 547 904 193 
5.7E6 0.0 1.8E8 -7.6 41.7 91.8 16.3 -4.7 42.8 92.4 166 06 A4.8°92.7-17.0 20.6°531 95.4 1816 -3.3 53.105: 167 
5.7E6 0.0 1.4E8 -10.9 36.1 91.9 14.1 -7.7 37.2 92.2 14.4 1.0 39.4 93.6 15.0 18.3 46.3 95.9 16.3 -8.5 42.5 926 143 


Estimation Variance estimation 


peas V0 Vi Yi Vi Vir Veo 
Month Change* RB Var* RB CV CP MW RB CV CP MW RB C CP MW RB CV CP MW RB CV CP MW 
De 7.7E4 -0.8 5.1E7 -40.8 27.0 84.5 10 =12:9) 4152) 915" “8:4 eGlon4a7 200 4 8.8 8.2 48.8 944 94 99 548 930 9.5 
3 


SIE4 -1.4 16.287 - -31.26304) 86.4 83 .-8.7 42.6 91.2 9°5 =3-26A4.5°\919e.98 1236499 °94.1°106 -3:1 47.7 91.40.08 


NNN AW Lb 


oi 1.6E4 19.6 -9.1E7 -27.2.44:0 87.1 10.3 -1.1 59.4 928 1204s 62:1 94 1942:3 “2495930 95.8 145 9-55-91 8 91.4 11.7 
5 4.4E4 -0.4 9.5E7 -37.5 38.4 83.4 9.7 -10.0 586 908 11.7 §-3:9761.8 “O1S 121" 14 5x7 Os 4a o> Vanes P28) OW 
6 sIOES -19.359.0E7 237 0—324 S34 91502111 42 1.806 1134.7 545.5 904 119) T1758 92.4 12.7 -3.3 54.7 90.9 11.8 
7 7.9E2 48.7 1.0E8 -39.3 42.6 83.7 9.9 -14.5 59.7 892 11.7 

Total: population total. 

Change: population difference between the current month and the previous month. 

Var: variance of the estimated total or change. 

RB: relative bias = 100(bias/true value)%. 

CV: coefficient of variation = 100 (standard error/true value)%. 

CP: coverage probability of asymptotic confidence interval using estimated variance (in %). 

MW: (mean width of asymptotic confidence interval) /10’. 

*: Scientific notation (for example, 6,700,000 is 6.7E6). 


9.8 62.4 90.2 120 7.6 72.6 92.6 13.1 -1.3 76.8 90.6 126 


Table 4 
Simulation Results for Hours 
Estimation Variance estimation 
of Total v v, v v ve, Hae 
10 t tl tl tl 12 


Month Total* RB Var* RB CV CP MW RB CV CP MW RB CV _ CP MW RB CV CP MW RB CV cp MW 

1 1.9E8 -0.1 5.8E10 -31.5 28.0 79.0 8.0 2.3 44.4 883 97 12.3 46.5 90.5 10.2 33.4 53.4 93.6 11.1 80 48.7 90.9 100 
2.0E8 -0.1 1.2E11 -30.2 32.8 84.7 11.6 -7.7 40.4 90.6 133 0.142.8°91.7 13.9 19:7 494° 94.3215,2 73.8 491) 90.1 441 
2.0E8 -0.1 1.8E11 -23.3 30.0 86.3 14.9 -6.3 36.7 90.3 16.4 -1.0 38.1 91.2 16.9 19.6 43.8 946 186 1.4 45.2 90.7 17.1 
2.0E8 0.0 3.2E11 -20.2 35.6 90.2 20.2 -0.5 47.1 93.4 226 5.6 49.7 93.3 23.3 27.9 59.8 95.3 25.6 -0.4 79.7 91.2 226 
2.1E8 0.0 4.4E11 -21.2 40.5 88.9 23.6 -7.9 52.3 90.7 255 -1.6 55.1 92.0 26.3 18.0 64.4 94.2 288 -5.1 64.2 90.9 258 
2.1E8 0.0 3.4E11 -10.4 46.3 92.1 22.1 -5.9 48.9 92.2 226 -1.0 50.7 93.0 23.2 20.8 59.9 94.7 25.6 -3.3 65.7 90.3 229 
2.1E8 0.0 2.3E11 -7.0 40.8 93.0 18.5 -2.2 42.8 93.2 19.0 4.2°44,7 94.1 419:6"'27-2053,2" 95 8221.6 “29:7 490) 90.9184 


Estimation Variance estimation 


of Change 1 e = 
8 V0 Y, va V, Via V2 


Month Change* RB Var* RB CV CP MW RB CV cP MW RB CV CP MW RB CV "cP MW RB CV CP MW 
2 5.0E6 0.1 8.8E10 -38.8 25.9 89.0 93 -9.7 35.1 92 AE V3 39-2. 2 Bi 2093, 7 Mil Tl Gas 43 OR Ob 12.8 6.7 43.8 93.6 12.3 
3 3.8E6 -1.0 1.1E11 -36.5 25.2 88.4 10.6 -12.6 34.5 91.9 12.4 -6.7 36.0 92.4 12.8 10.4 41.2 93.9 13.9 * 014 #43" 93.2'513.3 
4 1.0E6 11.0 2.1E11 -31.2 45.6 87.3 15.2 -5.0 59.3 90.9 17.9 0.6 62.4 91.6 18.4 21.6 75.2 93.9 20.2 -3.4 98.8 91.5 18.0 
5 2.1E6 -0.5 2.2E11 -41.6 39.9 85.6 14.3 -14.3 63.9 91.1 17.4 -8.4 66.6 90.1 18.0 10.5 76.0 94.9 19.7 132952193291 8:9 
6 “PAES °=7-8. VOEVY 4017395 82.57 13.5019. 9 47.5 280.5 16.3°-°-6:5:5503-490:716:9" 12-7" 60.199440418:5 9: Sree O TE 11616 
7 2.5ES -7.2 2.1E11 -39.0 48.4 82.9 143 -15.1 

Total: population total. 

Change: population difference between the current month and the previous month. 

Var: variance of the estimated total or change. 

RB: relative bias = 100(bias/true value)%. 

CV: coefficient of variation = 100 (standard error/true value)%. 

CP: coverage probability of asymptotic confidence interval using estimated variance (in %). 

MW: (mean width of asymptotic confidence interval yo i 

*: Scientific notation (for example, 6,700,000 is 6.7E6). 


HA A SD Ww 


60.3 89.5 16.9 -10.6 62.4 903 173 8.0 72.3 94.0 19.0 -3.9 “82.0 91.8°18.0 
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Simulation Results for Weekly Pay 


Estimation 
of Total 

Month Total* RB Var* 

1 2.0E9 -0.1 9.5E12 
2Z1ESP-ON LIES 
2.1E9 -0.1. 2.2E13 
2.2E9 -0.1 3.7E13 
2.2E9 -0.1 5.0E13 
2.2E9 -0.1 4.5E13 
2.2E9 -0.1 3.5E13 


NYDN FW WN 


Estimation 

of Change 
Month Change* RB Var* 
6.4E7 -0.1 1.5E13 
35E7 1-6 1-3E13 
DIET OG 24513 
2.1E7 -0.4 2.4E13 
1.4E7 «62.0 -2.3E13 
7 Eee OM ETS 


Nn PWN 


Yio 


RB CV CP MW 


-30.7 
-27.2 
-14.3 
-12.3 
-16.0 

-9.4 

-7.3 


RB 
-37.6 
-31.7 
-29.5 
-40.5 
-40.8 
-40.5 


30.4 81.8 
27.8 84.3 
34.7 85.6 
40.3 90.1 
41.6 89.0 
44.1 92.0 
43.1 92.1 


V0 
CV CP 


25.7 85.4 
Aifee) hd 
47.1 86.7 
34.1 83.5 
31.1 84.4 
42.0 83.1 


10.3 
14.1 
17.4 
22.8 
25.9 
Za 
22.8 


MW 
{2,72 
We) 
16.5 
iS), 11 


RB 
iNe2/ 
-3.4 
1.1 
6.4 
-1.5 
-3.8 
-0.7 


RB 
-8.2 
-5.2 

0.4 
92 


14.8 -13.5 


16.0 


-13.9 


Vi 
GVS Cr 
38.4 93.0 
42.3 92.2 
63.2 91.9 
Spii/ 0s) 
46.0 91.4 
56.5 89.2 


Total: population total. 


MW RB 


12.4 
16.2 
18.9 
Zari 
28.1 
26.3 
23.6 


MW 
14.8 
14.0 
19.6 
18.7 
17.8 
19.3 


Variance estimation 


Via 

17.2 44.3 92.4 13.3 
Uy CNT SNe 7/E 
8.0 43.9 89.5 19.5 
13.8 53.0 94.1 26.0 
954-85 92.0829 
1.8 48.7 92.8 27.1 
6.8 50.0 93.9 24.5 


Variance estimation 
Vi 
RB CV CP MW 
0.2 40.4 94.1 15.5 
2.2 43.8 92.8 14.6 
6.7 66.2 92.6 20.2 
-2.4 58.9 92.0 19.4 
-6.7 48.9 92.1 18.5 


Sh Sd SAO NR! 


39.8 
SHI 
34.9 
41.2 
295 
27.8 
SR 


RB 
21.6 
223 
30.7 
19.9 
16.8 
13.0 


v 
CNET CR MWe RB CVee Chari. 


54.4 
48.1 
51.4 
63.0 
64.6 
57.8 
57.0 


tl 


94.4 
935) 
93 
96.1 
94.3 
95.0 
96.4 


14.6 
18.9 
ZAES 
28.9 
By 
30.3 
272 


MW 
vel 
1529 
22.4 
215 
20.7 
22.1 


RB 
4.3 
333 
2.6 

-0.9 

-5.4 

-0.4 

-0.0 


RB 
a5 
3:5 

-4.3 
3.6 

-4.4 

-3.7 


Veh? 
CVanCr 
48.9 91.0 
S159 1-6 
50.4 91.4 
84.5 92.8 
56.0 92.4 
54.1 94.2 
54.3 95.3 


Yap 2 
CV EP 
49.2 92.6 
43.2 93.5 
96.9 90.6 
90:0" 92.5 
53/08 ONES 
69.5 90.8 


MW 
12.6 
16.8 
19.0 
24.2 
27.5 
26.8 
2H 


MW 
ley) 
14.7 
RZ 
199 
18.8 
20.4 


Change: population difference between the current month and the previous month. 


Var: variance of the estimated total or change. 
RB: relative bias = 100(bias/true value)%. 
CV: coefficient of variation = 100 (standard error/true value)%. 


CP: coverage probability of asymptotic confidence interval using estimated variance (in %). 


MW: (mean width of asymptotic confidence interval ) / 10”; 
*: Scientific notation (for example, 6,700,000 is 6.7E6). 


From Tables 2 through 5, the relative biases of esti- 
mators of monthly totals and changes are negligible for all 
variables. The following is a summary for the simulation 
results of variance estimators in terms of RB and CV. 


1. As expected, the naive variance estimator v,, has a 
large negative relative bias. 


2. The asymptotically unbiased variance estimator 
V1 —¥;2 performs well in general. Its relative bias is 
always under 10% in absolute value and is frequently 
under 5%. 


3. The variance estimator v,, has a large positive relative 
bias in all cases. This indicates that the v,, term is not 
negligible in the CES in which the overall sampling 
fraction, n/N, is about 15%. 


4. The variance estimator v,,, which is the same as v,, 
but with sampling fractions n,/N, incorporated 
(section 4), has a negative relative bias in general. Its 
negative bias may be large, especially in the estimation 
of the variance for monthly changes. 


5. The variance estimator ¥,,, which is the same as V,, 
but with sampling fractions n,/N, replaced by 
r,,/N,, performs well in the simulation study, 
although it is not asymptotically unbiased (section 4). 
Its relative bias is large in a few cases, e.g., in variance 


estimation for total of weekly pay at months 1 and 4, in 


variance estimation for total of hours at month 1, and in 
variance estimation for change of employment at 
month 7. In many cases, however, the performance of 
¥,, 1s even better than the asymptotically unbiased 


estimator ve eo: 


The following is a summary for the simulation results of 


confidence intervals in terms of CP and MW. 


10, 


The CP of the confidence interval based on the naive 
variance estimator v,, is substantially lower than the 
nominal level 95% in most cases. 


The CP of the confidence interval based on the 
asymptotically valid variance estimator, v,,-Vv,,, 1S 
between 90% and 93% in most cases. This is often the 
case for an asymptotically valid variance estimator, /.e., 
its relative bias is small but the CP of the related 
confidence interval is lower than the nominal level. 
One possible reason is that the convergence in distri- 
bution (asymptotic normality, which is the key for 
asymptotic confidence intervals) requires a larger 
sample size than the convergence of the second 
moment (in variance estimation). 


In terms of CP, the confidence interval based on v,, 1s 
the best. This might be because the overestimation in 
variance offsets the undercoverage in interval estima- 
tion. The mean width of the interval based on v,, may 


94 Shao and Butani: Variance Estimation for the Current Employment Survey 


be substantially larger than those of other intervals, 
especially for weekly pay. 


4. The CP of the confidence interval based on ¥.., which 
is not asymptotically valid, is similar to that of the 
confidence interval based on Viger Vp gs 


6. CONCLUSION AND DISCUSSION 


For the survey estimators in the Current Employment 
Survey (CES) with imputed data, we propose an asympto- 
tically unbiased and consistent estimator V1 7 V9 (Section 
3). Although v,, can be easily computed using the grouped 
balanced half sample method, the computation of V5 
involves separate derivations for nonlinear estimators. 
Thus, several approximations, Vy Yeas and ¥, (section 4) 
are considered and compared with V,, ~¥,5 in a simulation 
study in which a CES dataset is used as population. Our 
result shows that v,, and vs have large relative biases, due 
to the fact that the overall sampling fraction, 15%, is not 
negligible; the estimator .,, which is the same as v,, but 
incorporates an estimated sampling fraction (using the rate 
of response) in the balanced half sample method, performs 
fairly well. Thus, 9, is recommended to replace v,, -v,, if 
the computation of v,, is too complicated. Since the use of 
the “observed sampling fraction” r,,,/N,, does not reflect 
the fact that information is available about the nonres- 
pondents from previous months, v., May be improved 
using a more accurate estimated sampling fraction, for 
example, Rubin’s (1987) “fraction of missing information”. 

Although our study is based on the CES, our results are 
applicable to any survey that adopts a similar sampling 
design and a similar imputation method. Furthermore, an 
extension to the case where model (2) involves 
Ve io Vent 72 Voip with an integer s>2 is straightforward, 
although the derivation of V,. (for an asymptotically valid 
variance estimator) is more complicated. 
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APPENDIX: PROOF OF (4) 


It suffices to show that 
aI de A aA (9) 


We show the case of a single imputation cell and ype ay 
(employment). The general case can be treated similarly. 


We use mathematical induction. When t = le 


By assumption (2), 
Cov,, (Y,, ¥,) = a; V,,(Yo) + 0? E,, (Yp) 


= N (avy +67 Hy) 


Ve). 


Suppose now that (9) is true at time t-1. Let E,, V, and 
Cov, be the expectation, variance and covariance condi- 
tional on Yi is R,,j=1, wat aLnen 


EA Yea saa, 
and 


Cov ,(Y,, Lies Cov, (4, I ace es) 


= ¥,, Cov, (G,, Y,) 


RIG IS 
Pent OR 


where the last equality follows from assumption (2). By the 
induction assumption, 


Sea ana esha Por ah 
Then 


Cov,, (¥,, ¥,) = Cov, [E,(Y), E,(¥)] +E, Cov,(¥,,¥)] 
= 4; Cov,, (5 ¥,) + 0°, (F,.1) 
=o; V,,(¥,:) +07, (¥,.1) 


=, V_(¥,). 
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Implementing Rao-Shao Type Variance Estimation with 
Replicate Weights 


MICHAEL P. COHEN! 


ABSTRACT 


In estimating variances so as to account for imputation for item nonresponse, Rao and Shao (1992) originated an approach 
based on adjusted replication. Further developments (particularly the extension to balanced repeated replication from the 
jackknife replication of Rao and Shao) were made by Shao, Chen and Chen (1998). In this article we explore how these 


methods can be implemented using replicate weights. 


KEY WORDS: Balanced Repeated Replication; Jackknife replication; Imputation; Item nonresponse; Weighted hot deck. 


1. INTRODUCTION 


Variance estimation by replication methods is facilitated 
by the use of replicate weights (Dippo, Fay and 
Morganstein 1984). In the past decade adjusted replication 
methods have been developed (Rao and Shao 1992; Shao, 
Chen and Chen 1998) that allow one to account for the 
variation due to imputation for item nonresponse in the 
estimation of variances. It is not, however, entirely obvious 
how these adjusted replication procedures can be imple- 
mented by means of replicate weights. This article explores 
how this can be done. The focus is on ways to prepare the 
dataset so that standard variance estimation software 
products that make use of replicate weights will work 
without modification. In the next to last section, however, 
some comments are made about whether modifying the 
software would help. 


2. REPLICATION METHODS AND REPLICATE 
WEIGHTS 


Wolter (1985) provides a comprehensive introduction to 
variance estimation for sample surveys. Chapters 3 and 4 
cover the two replication methods pertinent to this article: 
the jackknife and balanced repeated replication. Shao and 
Tu (1995, chapter 6) is recommended for a more recent and 
advanced treatment. Variance estimation for surveys by 
replication continues to be an active area for research. 
Works that are even more recent include Brick and 
Morganstein (1996, 1997), Kott (2001), Rao and Shao 
(1996, 1999), Rust and Rao (1996), Shao (1996) and 
Valliant (1996). 

The two replication methods work by creating subsets of 
the sample called replicates. The methods differ in the 
pattern by which replicates are formed. In balanced 
repeated replication (also called balanced half-sample 
replication), the replicates consist of roughly half the units 


in the original sample; hence they are also called half 
samples. In jackknife replication (as applied to survey data), 
the replicates typically consist of the original sample except 
that a single primary sampling unit (PSU) or a small 
number of PSUs in the same stratum is deleted. For both 
methods, the replicates can be considered samples in their 
own right. Therefore if 6 is an estimate of some quantity @ 
based on the original sample, we can form an estimate 6” 
of @ based on replicate r. If there are R replicates, we 
estimate the sampling variance of 6, var (6), by 


R 
var @) = Cygd, 6” - 6? Gy) 
r=] 


where the constant C,, p depends solely on the replication 
method M and the number of Teplicates R. 

In forming the estimate 6 of @, use is made of the 
sample weights. For example, to estimate a population total 
for a particular item y, the estimate is the weighted sum of 
the values of y. Thus, if y, and w, are the values of y and 
the sample weight for sample unit u, then 6 = yu 
where the sum is over all units in the sample. In addition to 
the sample weight w, on the record for unit u, we can add 
replicate weights w, r=1 to R, to the record on the file 
and calculate 6°” in the same way as 6 except that wi? 
replaces w, for each sample unit u. Thus for the example in 
which 6 is the population total for y, 8° = ))w, Se If unit 
u is not in replicate r, then wi? = =(. Some or all of the 
replicate weights for units that are in the replicate will be 
larger than their sample weights so that the units in the 
replicate continue to represent the entire population. 

The use of replicate weights provided on the file to 
calculate the sampling variance estimates has advantages: 


— Any statistics no matter how complicated that can be 
calculated for the whole sample can be calculated 
just as easily for each replicate. The sampling 
variance is then estimated by (2.1). 
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— Adjustments for unit nonresponse and poststrati- 
fication can (and should) be done individually for 
each replicate and incorporated in the replicate 
weights. This adjustment is usually done by an 
experienced sampling statistician and the adjusted 
replicate weights are put on the file so that the data 
analyst can use them without extra effort. 


— Adjustments to the replicate weights put on the file 
can make use of auxiliary information not available 
to the data analyst, possibly for reasons of confi- 
dentiality. Even if not restricted, the auxiliary infor- 
mation may be difficult for the data analyst to obtain 
or use. 


- General purpose software is available that employs 
replicate weights. Two software products that 
emphasize replication methods for surveys are 
WesVar from Westat, Inc. and VPLX from the U.S. 
Census Bureau. See the Web page 


//www.fas.harvard.edu/~stats/survey-soft/survey- 
soft.html 


for information on survey analysis software. 


In this section we have ignored the complications that 
come from trying to capture the component of variance due 
to item imputation in the variance estimates. We begin to 
address these complications in the next section. 


3. ADJUSTED REPLICATION METHODS 


The works of Rao and Shao (1992) and Shao, Chen and 
Chen (1998) are key to this article. Shao and Chen (1999) 
and Shao and Steel (1999) also treat replication-based 
variance estimation for imputed survey data. 

We begin by developing the notation, for the most part 
using that of Shao, Chen and Chen (1998). The population 
is divided into L strata with N , Clusters in the Ath stratum. 
In the first stage of sampling in stratum h,n, > 2 clusters 
are selected, the ith cluster being selected with probability 
Dy SO= NY. N,;h =1,...,L. The clusters are selected 
without replacement and clusters in different strata are 
selected independently. The sampling fractions n,/N, are 
assumed to be small enough that no finite population 
correction is needed. Further stages of sampling may take 
place within each cluster, independently from cluster to 
cluster. There are N, , ultimate population units in cluster i 
of stratum h. For population unit (A, i, 7), there is a variable 
y,;; Of interest. Let S be the collection of all sample units 
and let {Vp ij> (A, i, 7)€S} be the imputed dataset: the Vrij 
are equal to y, ;; When the item is observed and equal to the 
imputed value otherwise. The sample units are divided into 
imputation classes indexed by k and A , 18 the index set of 
respondents for item y in imputation class k. We assume 
that the dataset contains identifiers (“flags”) so that the 
nonrespondents can be identified. 


In adjusted replication methods, Vnij in imputation class 
k is adjusted to 


~ (r) ~ ra, 
Yniy * Ex Ynii) ~ Ey ni) 


Ye if yp»; iS imputed 


hij 


Vij 


if Ypij 1S Observed, Ga) 


where E, is the expectation with respect to the original 
imputation procedure within imputation class k and Ey is 
the expectation with respect to the imputation procedure 
based only on data in the rth replicate within imputation 
class k. This formula is given explicitly in Shao, Chen and 
Chen (1998, page 822) for balanced repeated replication 
and a variety of imputation methods. It also applies to the 
development in Rao and Shao (1992) for jackknife repli- 
cation and weighted hot deck imputation. 

We shall adopt the notation that (h°i°j°) denotes a unit 
that did not respond to item y and (h'i'j’) denotes a unit 
that did respond to item y. We assume that 


Fy rj: eat ali Yaiitj' 


(h'i'j’)eA, 
and 
(N>~ se (r) 
Ex. (Yp,2 2 ;°) = De Bn iry's neiege Vari’ 
(h'i'j )EA, 
WHETE the 4 i, 0% 0s) ANG. Oreo oe are. Constainieinar 
‘ ihe or Ae jaaie ey (r) 
depending on the values of the Varinye AN yin ir. po j2;2 = 0 


for (h'i'j’) not in replicate r. The nije pe i272 and 
ar j’; nie; May depend on auxiliary information available 
for all units in the sample. For the weighted hot deck of Rao 
and Shao (1992) and all of the imputation methods of Shao, 
Chen and Chen (1998), the expectations have this form. 


3.1 Example: Ratio Imputation 


This imputation method applies to situations in which 
there are auxiliary data { x, i} available for all sample units. 
Ratio imputation imputes a missing item Yj 2;0;° by 


App dS 0 WV peony fone waedex Ame 


(hij EA, (h'i'j' EA, 
So 
Bit’ she i? f° X38 72 wpa i W pregerye Xpyrrjrrje 
(h' ij" EA, 
and 
(r) 2! (r) > (r) 
Bhi hei? j° = Xho 52 j° Wri; Wren gt Xpyrrperqee 
(hi TEA, 


: (r) 
Notice that the Brit pe i272 and a 


| ij h'i’j’:neicje depend on the 
i 


bp 
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3.2 Example: Weighted Hot Deck Imputation 

This imputation method imputes a missing item by a 
value randomly selected from the respondents to the same 
item with probability proportional to the weights of the 
respondents in the imputation class. See section 5 for 
further discussion of this method. Shao, Chen and Chen 
(1998, page 822) show that 


Ey, Vie isje) = Di ote is few? Ds ay 


(h'i'j )eA, (h'i'j EA, 
and 
(> ~ E (r) (r) 
Ex. Dye; r) 3 Dy Writj' Yh eye Dy hii'j 
(h'i'j EA, (h'i'j EA, 
Thus 
Onsitj he Wainy | Wreit'j 
(Ai i" eA 
and 


(r) (r) (r) 
Beg jes ye ep mw OTe DD wtih y 


4. THE DATA FILE FOR VARIANCE 
ESTIMATION 


For simplicity we assume that each record contains an 
identifier indicating to which imputation class the unit 
belongs. Often the imputation class is determined by several 
variables on the record. A record will look something like 
this: 


ID IC Whij Whi a Wri Vnij i ij IF, 
where ID is the identifier for the unit, /C is the identifier for 
on toate class, w,., is the (full sample) weight, 
Whij Wai; are the replicate weights, y,,, 1s the value 
(possibly imputed) of the variable y under ‘consideration, 
IF, is the imputation “flag” that indicates whether y, |, 1s 
imputed, Z,,, is the value (possibly imputed) of another 
variable z and /F,, is the imputation “flag” that indicates 
whether Z,,, is imputed. There, of course, may be other 
variables on the files as well, for example an auxiliary 
variable Xnij available for all sample units. 

We propose to add additional records, called extra 
records, to facilitate variance estimation. For each non- 
respondent (h°i°j°) and respondent (h'i'j’) to item y in 
imputation class k, we create the record 


~ (1) ~(R) 
ID iC O Wreieje.nije Whe IRS i Yni'j' LEO TE 


where /C =k, /D is the identifier of the unit (h° 7° 
did not respond to item y and 


J°) that 
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~ (r) “ (r) (r) 
Wreisjecniiti = (Aninjrsnei2j2 Dyin sne ir je) W p9 52 j29 


r=1,..,R. 4) 


Note that the full sample weight is 0 on the extra records so 
these records do not affect the full sample estimates. The 
replicate estimates, though, agree with those defined by 


(3.1). Note also that the weights ern may be 
negative. 
Table 1 
Numerical Illustration: Portion of Data File for 
Variance Estimation 

ID BIG pe Wee MWe tis OnWinn Pm Jip Ey? top | Ee 
001 1 10.1 20.2000 0.0000 5.4 hace hed 1 
002 1 20.3 40.6000 0.0000 5.1 Mavic sry 0 
003 1 18.4 36.8000 0.0000 5.2 On? 3ss..0 
004) 107K I 0.0000 2222000 “5:4 PICTZE V0 
005 1 16.3 0.0000 32.6000 5.1 atl ae A 
006 1 15.4 0.0000 30:80007 S Al oh" ON el4 50 
OOT ET 20.055 3:0162 0.0000 5.1 Zz OO 3 
OOle Fis - 0.0 2.7339 0.0000 5.2 Lie Oe wees 
DOR MAME D.08S.7501 0.0000 5.4 DE OO 3 
004 1 40.0 0.0000 -8.3301 5.1 2 AGO 3 
004 1 0.0 0.0000 BS feo 94 Oe pe 55) es oO ene 
004 1 40.0 0.0000 15.8806 5.4 215110.0; 3 
OOS IT OCOMe WOOO0R +31212:230555"1 2S 3 
005 1 O.0 0.0000 -- -11.0876 5.2 PUNO. 5 
O05 Lae (OOF 70 G008 De SLOSS BZ OOF | 33 
OOL 1 200" 25.0645 0.0000 0.0 esate Ih: 2 
OOPYA., 470:010N5;0436 0.0000 0.0 SUS 2) 
OOL 10:0" 2.7512 0.0000 0.0 sree? yg 
001 1 0.0. -4.0400 0.0000 0.0 Siewrisa) 2 
OOL . sleanG:0" 425.8169 0.0000 0.0 Sal WAY. 2 


Table 1 provides a numerical illustration. In the illustra- 
tion, the nine records (rows of the table) with /F, = 2 are 
the extra records for item y. The first six records are the 
original records for the six sample units that constitute 
imputation class JC = 1. (The records at the end with 
IF, =2 are the extra records for item z and will be 
discussed in the next paragraph. In these records, the 
imputation flag for y, /F’,,, has been set to 3 to indicate that 
these are extra records for an item other than y.) There are 
three respondents (/ ae =(0) and three nonrespondents 
UF ui = 1) to item y. The method of imputation is assumed 
to be Weenie hot deck. Only the first and last replicate 
weights (Wri ’ and Wray ) are presented, but these are 
consistent with replicate weights used for the balanced 
repeated replication method of variance estimation. We 
have Wij Yniy = 476. 650, Mavi Vp ij = 206.048 and 


Yea Ypij = 455.696 where the sums are over all the 
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records. The reader may verify that this agrees with 
L Wri Sng = 476-650, Yo wy;) Fqy? = 506.048 and 
yy Wrap cae = 455.696 obtained using (3.1) where the sums 
are over the first six records only. 

Let us now consider item z. The extra records for this 
item have the form 

ID CTO aie et rain, IW cipal ba 
where pen - We ae are computed by (4.1) but 
using the imputation method and response pattern for item 
z. The imputation method for z need not be the same as the 
imputation method for y but must be of the form discussed 
in section 3. In Table 1 the extra records for item z can be 
identified by having JF, = 2. We have then Waka = 
120.130, Ywyij 2, = 124.349 and Pw, z,,, = 115.400 
where the sums are over all the records. This agrees with 
the sums obtained by (3.1). 

Clearly the biggest disadvantage of this approach is the 
large number of extra records that have to be added to the 
file. This disadvantage is less severe when the imputation 
Classes are small. (There are, however, many factors that go 
into determining the size of the imputation classes.) The 
advantages, on the other hand, include the following: 


- The adjusted replicate estimates and variance esti- 
mates can be computed with any software designed 
to estimate variances by means of replicate weights. 


— If there is another variable, say y’, with the same 
pattern of nonresponse and the very same method of 
imputation as y (that is, the same a and a values), 
the computation of replicate estimates for y’ can be 
accommodated without adding more records. 


— One can make estimates over subdomains, even if 
they cut across imputation class boundaries. 


— Suppose the method of imputation is the weighted 
hot deck. Then one estimates the variance of a 
derived variable, say log y where y>0, by simply 
adding the derived variable to each record and 
computing replicate estimates based on it. (We shall 
have more to say about the weighted hot deck in the 
next section.) 


The data analyst may choose to delete the extra records 
from a copy of the data file and use the reduced file to 
check for outliers, formulate hypotheses, etc. When it 
comes time to estimate variances, the extra records would 
be merged back in. 

It should be pointed out that Rao and Shao (1992) 
proposed and evaluated their jackknife variance estimation 
method only for the estimation of totals (or means). One 
must be cautioned against the use of the approach for more 
complex statistics. In the same way, Shao, Chen and Chen 
(1998) proposed their balanced repeated replication 
variance estimation method for functions of totals and for 
quantiles so it should not be used for other statistics. 


5. THE WEIGHTED HOT DECK 


The use of the weighted hot deck method of imputation 
(e.g., Cox 1980) has a number of advantages so we devote 
a separate section to it. Rao and Shao (1992) concentrate on 
this imputation method and it is discussed also in Shao, 
Chen and Chen (1998). Under this method, a missing item 
is imputed by a value selected at random from the 
respondents to that item in the imputation class. The proba- 
bility of selection is proportional to W,,;-;, the weight of the 
respondent. The respondents that have a positive probability 
of being selected are called potential donors: the non- 
respondent being imputed is the recipient. If there is more 
than one item on the file that will be imputed by the 
weighted hot deck, simplifications occur if one uses 
complete respondents (units who responded to all items) as 
potential donors and uses only one donor to impute all items 
requiring weighted hot deck imputation for a given 
recipient. (The donor is selected for each sample unit 
having any item for which there is item nonresponse. ) 

If each unit in an imputation class has the same chance 
of responding to an item, the weighted hot deck yields 
design consistent estimates of means, totals and sample 
quantiles. The imputations, moreover, will be “plausible” in 
the sense of looking like real data. 

An advantageous feature of the weighted hot deck is that 
it is equivariant under one-to-one transformations. To 
explain equivariance, consider a derived variable d where 
d=g(y) and g is a one-to-one function. Then,using the 
weighted hot deck, we impute item y of unit (h°i°;°) that 
did not respond to the item by Ve jeje and use g( Vpe jeje) for 
d. This is equivalent to using the weighted hot deck to 
impute d by d,.,.,.. and using g'(d,......) for y. This 
feature of hot deck imputation in not shared by many other 
methods. For example, under mean imputation (in which 
the imputed value is the mean of the values for respondents 
in the imputation class), g would have to be linear for the 
equivariance property to hold. The pertinence of this to 
variance estimation by adjusted replicate methods is that 
when hot deck imputation is used, the data analyst can add 
d= g(y) to the file and estimate variances for d as well as 
for y. 

Suppose that the weighted hot deck is employed for 
several variables on a file and suppose that only complete 
respondents are used as potential donors. In this case, even 
if the patterns of nonresponse are different for the variables 
being imputed, the implementation of the adjusted repli- 
cation by replicate weights described in the previous section 
can be carried out with the same set of extra replicate 
weights 


~ (7) 


(7) 
W pe 


(r) 


= (a h'i'j’sh? i? jo) Wy, j9 j2 


7° Rit iyg 


for each variable. 
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6. ALTERNATIVES 


In this section we consider alternative methods including 
one that requires modifying the software. 


6.1 First Alternative 


One way to reduce the number of records is to have extra 
records of the form 


PL ea al aries 


ye IF, 0 IF, 


where /D’ is the identifier of the potential donor unit 
(h'i'j') that responded to item y, B, is the index set of 
units not responding to item y in imputation class k and 


(r) 
) Whe jojo 


= \e (r) 
Wrrirj’ (a Te Sisley Ta ytityts aaa) 2 


Pace o ult 


Under this setup, for a given item there is only one extra 
record per potential donor. The chief disadvantage is that, 
because of the summation, estimates for subdomains that 
cut across imputation classes cannot be computed. 


6.2 Second Alternative 


SMES the most obvious implementation would be to 
add me Sh - to the (hij) record and modify software to use 
the Ae rather than Y, ;; when computing replicate esti- 
mates. ‘The chief drawbacks are (1) sophisticated repro- 
gramming of software would be needed, (2) if multiple 
variables may require imputation, the number of fields 
needed expands greatly and (3) it is unclear how a data 
analyst would ee the variance of a derived variable, 
say d, unless the an ei ; were put on the file in advance. The 
favorable features of this implementation are (1) no extra 
records are needed and (2) variance estimates for sub- 
domains do not require additional work. 


7. CONCLUDING REMARKS 


The adjusted replication methods of Rao and Shao 
(1992) and Shao, Chen and Chen (1998) provide a way of 
computing variance estimates that account for imputation 
for item nonresponse. An important next step is the 
development of ways to facilitate the computation. This 
article explored implementations based on the use of 
replicate weights. 
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Variance Estimation for the General Regression Estimator 


RICHARD VALLIANT' 


ABSTRACT 


A variety of estimators of the variance of the general regression (GREG) estimator of a mean have been proposed in the 
sampling literature, mainly with the goal of estimating the design-based variance. Estimators can be easily constructed that, 
under certain conditions, are approximately unbiased for both the design-variance and the model-variance. Several 
dual-purpose estimators are studied here in single-stage sampling. These choices are robust estimators of a model-variance 
even if the model that motivates the GREG has an incorrect variance parameter. 


A key feature of the robust estimators is the adjustment of squared residuals by factors analogous to the leverages used in 
standard regression analysis. We also show that the delete-one jackknife implicitly includes the leverage adjustments and 
is a good choice from either the design-based or model-based perspective. In a set of simulations, these variance estimators 
have small bias and produce confidence intervals with near-nominal coverage rates for several sampling methods, sample 
sizes, and populations in single-stage sampling. 

We also present simulation results for a skewed population where all variance estimators perform poorly. Samples that do 
not adequately represent the units with large values lead to estimated means that are too small, variance estimates that are 
too small, and confidence intervals that cover at far less than the nominal rate. These defects need to be avoided at the design 
stage by selecting samples that cover the extreme units well. However, in populations with inadequate design information 


this will not be feasible. 


KEY WORDS: Confidence interval coverage; Hat matrix; Jackknife; Leverage; Model unbiased; Skewness. 


1. INTRODUCTION 


Robust variance estimation is a key consideration in the 
prediction approach to finite population sampling. Valliant, 
Dorfman, and Royall (2000) synthesize much of the 
model-based literature. In that approach, a working model 
is formulated that is used to construct a point estimator of 
a mean or total. Variance estimators are created that are 
robust in the sense of being approximately model-unbiased 
and consistent for the model-variance even when the 
variance specification in the working model is incorrect. In 
this paper, that approach is extended to the general 
regression estimator (GREG) to construct variance esti- 
mators that are approximately model-unbiased but are also 
approximately design-unbiased in single-stage sampling. A 
number of alternatives are compared including the jack- 
knife and some variants of the jackknife. We will use a 
particular class of linear models along with Bernoulli or 
Poisson sampling as motivation for the variance estimators. 
However, some of these estimators can often be success- 
fully applied in practice to single-stage designs where 
selections are not independent. 


Associated with each unit in the population is a target 

sai Y, and a p-vector of auxiliary variables 

BVA et a) OWNETC Y=) 1) cv ne population vector 

a totals a the auxiliaries 1S a Sih Ufc be os where 

T= yes ioe sic roey Lhe Uh general regression 1 estimator, 

defined below, is motivated by a linear model in which the 
Y’s are independent random variables with 
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Eye) = xP 
ar (Years (1.1) 


In most situations (1.1) is a “working” model that is likely 
to be incorrect to some degree. 

Assume that a probability sample s is selected and that 
the selection probability of sample unit i is P(6, = 1) =, 
where 6, is a 0-1 indicator for whether a unit is in the 
sample or not. We assume that the sample selection 
mechanism is ignorable. Roughly speaking, ignorability 
means that the joint distribution of the Y’s and the sample 
indicators, given the x’s, can be factored into the product of 
the distribution for Y given x and the distribution for the 
indicators given x (see Sugden and Smith 1984 for a formal 
definition). In that case, model-based inference can proceed 
using the model and ignoring the selection mechanism. 

ser Aa n- pe of targets for the sample units is 

=(Y,,..,Y,)’, and the nx p matrix of auxiliaries for 
we ore ee is X= (x,,...,x,,)’. Define the diagonal 
matrix of selection probabilities as II, = diag (m,),i€s5, and 
the diagonal matrix of model-variances as we = diag (v,). 
The GREG estimator of the total, T=)” aye is then 
defined as the Horvitz-Thompson estimator or - -estimator, 
T=), Y;/;, plus an adjustment: 

(ere ter B Chest) (1.2) 

where B = A. Oe V. Th aye with A, =X/V, Tl, oe 
and ey =), x, In. ‘The GREG estimator can also be 
written as 
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Te ae (1.3) 
with g =V, 'X Ay eats =T ole and 1. being an 


n-vector of YP S. Expression cL 3) will be useful for 
subsequent calculations. 

A variant of the GREG, referred to as a “cosmetic” 
estimator, was introduced by Sarndal and Wright (1984) 
and amplified by Brewer (1995, 1999). A cosmetic esti- 
mator also has design-based and model-based interpre- 
tations. The variance estimators in this paper could also be 
adapted to cover cosmetic estimation. 

Assuming that.N is known, the GREG estimator of the 
mean is simply Y, = =Ps/ N. We will concentrate on the 
analysis of Yor au some situations, particularly ones where 
multi-stage seaaaihe is used, the population size is 
unknown and an estimate, N, must be used in the deno- 
minator of Vs The following analysis for the mean does 
not apply in that case.) Either quantitative or qualitative 
auxiliaries (or both) can be used in the GREG. If a quali- 
tative variable like gender (male or female) is used, then 
two or more columns in X , will be linearly dependent, in 
which case a generalized inverse, denoted by A.» Will be 
used in (1.2) and (1.3). Note that, although Ni is not 
unique, the GREG estimator Y c 1S invariant to the choice 
of generalized inverse. The proof is similar to Theorem 
7.4.1 in Valliant et al. (2000). 

The GREG estimator is model-unbiased under (1.1) and 
is approximately design-unbiased in large probability 
samples. Note that the model-unbiasedness requires only 
that E,,(Y;) =x; B; if the variance parameters in (1.1) are 
misspecified, the GREG will still be model-unbiased. On 
the other hand, if E,,(¥,) is incorrectly specified, the 
GREG is model-biased and the model mean squared error 
may contain an important bias-squared term. The estimation 
error of the GREG Y,, is defined as 

Y.-Y= Nee (Aiea al Ya) 
where Y = 7/N, a inHage -1,,Y, isthe (N - n)- vector of 
target variables for the nonsample units, and 1, is a vector 
of N-n1’s. Next, suppose that the true model for Yr is 


EY) = B 


ary, (Y,) = W,, (1.4) 
i.e., the variance specification is different from (1.1) but 
EY) is the same. Using the estimation error, the 
error-variance of Y,, is then 


None ge =i \oee Nas (ate ac el) 


where the n x n covariance matrix for Y, is Y= diag (¥,) 
and ‘is the (N - n) x (N- n) covariance matrix fOr y. 
When the sample and population sizes are both large and 
the sampling fraction, f=n/N, is negligible, the error- 
variance 1s approximately 


var, (Y= No Savas W;- (1.5) 


i€s 
Note that this variance depends on the true variance para- 
meters, ‘¥,, and on the working model variance parameters, 
V; , because v, is part of a;. Since a, is approximately the 
same as ie g. when selection probabilities are small, the 
error variance in that case is also approximately 


A z3 ee 
apelaay pie N?*y = 


HES re 
i 


(1.6) 


For model-based variance estimation, we will take either of 
the asymptotic forms in (1.5) or (1.6) as the target. How- 
ever, when the sampling fraction is substantial, the term 
1’ 1 _/N? canbe an important part of the error-variance 
and (1.5) or (1.6) may be poor approximations. 

We will consider the design variance under two single- 
stage plans-Bernoulli and Poisson. In Poisson sampling, the 
indicators 5, for whether a unit is in the sample or not are 
independent with P(6,=1)'= V- P(6,=0) =7, (see Sarndal, 
Swensson, and Wretman T9972, SCCUOnieS: 5 for a more 
detailed description). Bernoulli sampling is a special case 
of Poisson sampling in which each unit has the same 
inclusion probability. Under these two plans, the approxi- 
mate design-variance of Y,, is 


1-7, A) 


o N 
var, (Y,)=N?> E; (1.7) 


where E; = Y, - x; B and B = (X’V'X)! X’V''Y isthe 
regression parameter estimator evaluated for the full finite 
population. Sarndal (1996) recommends using the GREG 
in conjunction with sampling plans for which (1.7) is valid 
on the grounds that the variance (1.7) is simple and that the 
use of regression estimation can often more than compen- 
sate for the random sample sizes that are a consequence of 
such designs. 

The Bernoulli and Poisson designs and the linear models 
(1.1) and (1.4) serve mainly as motivation for the variance 
estimators presented in sections 2 and 3. As noted by Yung 
and Rao (1996, page 24), it is common practice to use 
variance estimators that are appropriate to a design with 
independent selections or to a with-replacement design 
even when a sample has been selected without replacement. 
Likewise, variance estimators motivated by a linear model 
are often applied in cases where departures from the model 
are anticipated. This practical approach underlies the 
thinking in this paper and is illustrated in the simulation 
study reported in section 4. 


2. VARIANCE ESTIMATORS 


Our general goal in variance estimation will be to find 
estimators that are consistent and approximately unbiased 
under both a model and a design. Kott (1990) also 
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considered this problem. Note that the goal here is not the 
estimation of a combined (or anticipated) model-design 
variance, 


Bene? | hy EM tere’ ||" 


Rather we seek estimators that are useful for both 
var ,, (Y - Y) and var, (Y). The arguments given here are 
largely heuristic ones used to motivate the forms of the 
variance estimators. Additional, formal conditions such as 
those found in Royall and Cumberland (1978) or Yung and 
Rao (2000) are needed for model-based and design-based 
consistency and approximate unbiasedness. 

First, consider estimation of the approximate model- 
variance given in (1.5). In the following development, we 
assume that, as NV and n become large, 


(i) N max (1,) = O(n) and 


l 
(ii) A,./N converges to a matrix of constants, A... 


A residual associated with sample unit i is r, = Y; - s@ 
where Y, =x; B. The vector of predicted values for the 
sample units can be written as 

Yo HY. (2.1) 
where H, = X, APEX? V,' I,’. The predicted value for an 
individual unit is Y¥,=)..,4,Y, where h,,=x/A,,x// 
(v,7,) is the (ij)"" element of H.. The matrix H, is the 
analog to the usual hat matrix (Belsley, Kuh and Welsch 
1980) from standard regression analysis. The diagonal 
elements of the hat matrix are known as leverages and are 
a measure of the effect that a unit has on its own predicted 
value. Notice that the inverses of the selection probabilities 
are involved in (2.1), although these would have no role in 
purely model-based analysis. 

The following lemma, which is a variation of some 
results in Lemma 5.3.1 of (Valliant et al. 2000), gives some 
properties of the leverages and the hat matrix. 


Lemma 1. Assume that (i) and (ii) hold. For 
H, = X,A,\X/ V,' II,’ the following properties hold for 
allwre st 


(a) h,=O(n a 
(b) H, is idempotent. 
(c), 0 shs<1) 


Proof: Since h,, = x; a x /(v, m,), conditions (i) and (ii) 
imply that h, = O(n). Part (b) follows from direct 
multiplication, using the definition of H,. To prove (c) 
note that h,,> 0 since it is a quadratic form. Part (b) implies 
that h,, = h;; + Y,,,/,,h,, which can hold only if h,; < 1. 

Next, we write the residual as) r,=Y,(1-h,;)- 
y esi) Mi Yj where s(i) is the sample excluding unit . Since 
u(';) = 9, we have Ey(r;) = var ,,(r;) and 
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Ey (r;) = Yh) t ay hi; ¥; (212) 
JES(i) 
under model (1.4). Using Lemma 1(a), we have h,,=o(1), 
h; = o(1), and consequently, E,, (r;) =f... Thus, in large 
samples, r; is an approximately unbiased estimator of the 
correct model-variance even though the variance specifi- 
cation in model (1.1) was incorrect. As a result, re isa 
robust estimator of the model-variance for unit / regardless 
of the form of ¥,. A simple, robust estimator of the 
approximate model-variance (1.5) is then 
Vai(Vg) = NAO ary ee) 
which is a type of “sandwich” estimator (see, e.g., White 
1982). (Note that a formal argument that v,, is robust 
would require conditions such that n'E,,(v,,) and 
n-'N~Y. a; ¥, converge to the same quantity.) Another 
variance estimator, similar to vp, if a= i i iS 


2 
A * gi 
Ye (Yg) =N?D 3. (2.4) 
5102 


t 


An estimator of the approximate design-variance in (1.7) 
iS 
1 


ve is lias Naeae 


Das ae 
2 Py (2:5) 


1; 


An alternative suggested by Sarndal et al. (1989) as having 
better conditional properties is 


1 


Yssw (Y,) siNjt De (2.6) 


aa ome. 
3 §i "i - 
1; 

Another, similar estimator, used in the SUPERCARP 


software (Hidiroglou, Fuller and Hickman 1980) and 
derived using Taylor series methods, is 


Bey ft, an) Ci 


CN ceinp 
Re acy n—l 1; n Ti, 
As shown in the Appendix, the second term in parentheses 
in (2.7) converges in probability to zero under model (1.1). 
Thus, v,~ vp, in large samples. 

When the selection probability of each unit is small, 
eae will be similar to vp,, Vp», and v.. All three will be 
approximately model-unbiased under (1.4) and approxi- 
mately design-unbiased under Bernoulli and Poisson 
sampling. On the other hand, v_ is approximately design- 
unbiased but ignores the g, coefficients and is biased under 
either model (1.1) or (1.4). 

As a simple example, consider Bernoulli sampling with 
m,=n/N and the working model E,,(Y;) =~; B, 
Wali be) o x,. Then the GREG is the ratio estimator 
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ee - Ved where x is a finite population mean. The 
approximate model variance under the more general specifi- 
cation, var,,(Y,) = v,, is (¥,/n)(x/x,)? where Y. = 

Ps n. The approximate design-variance is (1 -f)/(nN) 
yl een hs y’ where Y is a finite population mean. 
The estimator "Veo 7 = (tx) Seth ee yes 
approximately unbiased for the model- variance and, 
because x/x By in large Bernoulli samples, v,, is also 
approximately unbiased for the design-variance as long as 
fis small. In contrast, v,=n?(1-f) ¥(¥,-x, ¥,/x,) is 
approximately design- unbiased but is model-unbiased only 
in balanced samples where x = x. Royall and Cumberland 
(1981) noted similar results for the ratio estimator in simple 
random sampling without replacement. 


3. ALTERNATIVE VARIANCE ESTIMATORS 
USING ADJUSTED SQUARED RESIDUALS 


The first alternative variance estimator we consider is the 
jackknife. The particular version to be studied is defined as 


i A A 


Lite “— » | Yow aXe" 


(3.1) 


where ie has the same form as the full sample estimator 
after omitting sample unit i. If the selection probability has 
the form 1, =np,, then (3.1) can be rewritten. Using the 
convention that the subscript (i) means that sample unit i 
has been omitted, we have 


a A 


Yow = Toa@/N, Y GC) Dn Yous! n, Tei) 
a ith a4 Be (ice Tee), 
Ly one ewan 1) | alae 


1€s(1) 
Sh pes x,/[m,(n-1), and 
JEs(i) 


eet rant 
Be Am aNeray, 


=] E 


s(t) ~ s(i) 


=X /voimix 


Ans =X Vs Wye Xo 


Another more conservative, but asymptotically equivalent, 
version of the jackknife replaces ee with the full sample 
estimator Y,. Design-based properties of the jackknife in 
(3.1) are usually studied in samples selected with 
replacement (see, e.g., Krewski and Rao 1981, Rao and Wu 
1985, Yung and Rao 1996), but applied in practice to 
without-replacement designs. Note that for the linear 
estimator He =k Vics Y,/%;, in probability proportional to 
size without-replacement sampling, neither the jackknife, 
v,, nor the approximations to v, given later in this section, 
reduce to the usual Horvitz-Thompson or Yates-Grundy 
variance estimators. 

With some effort we can write the jackknife in a form 
that involves the residuals and the leverages. The rewritten 


form will make clear the relationship of the jackknife to the 
variance estimators in section 2. First, note the following 
equalities that are easily verified: 


A nas Yi) ae he ba 
Tew = oe | 1 aa ee teas es; Th - 1, (3.2) 


l 


XV ty 


" eet 
aay” syns ety he a ee el ee 


Ay Mas his YG 


(3.3) 


Using a standard formula for the inverse of the sum of two 
matrices, the slope estimator, omitting sample unit i, equals 


- Ax. r. 
pees eaey 6 poner fi 
(i) Ss T=, v, 1, 


Details of this and the succeeding computations are 
sketched in the Appendix. After a considerable amount of 
algebra, we have 


A A n — 
Pew ~ Tee = 7 CEES 


where 


and F’, is defined in the Appendix. The jackknife in (3.1) is 
then equal to 


n 
x 


v, (¥,) = Nee aa 


. OAD) suk 2)o 1k, Deed wae 


Expression (3.4) is an exact equality and could be used as 
a computational formula for the jackknife. This would 
sidestep the need to mechanically delete a unit, compute 
\Z Gi and so on, through the entire sample. 

In large samples the first term in brackets in (3.4) is 
dominant while the second and third are near zero under 
some reasonable conditions. Thus, in large samples the 
jackknife is approximated by v,(Y,) = NY. (D, - D,)’, 
or, equivalently, 


v7(¥e) % oe 
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As shown in the Appendix, the second term in (3.5) 
converges in probability to zero under model (1.1). Conse- 
quently, a further approximation to the jackknife is 


avec ger at |e 
vy (Y<) aia TAs Sal 
N Tl; (1 > h,,) (3.6) 
As (3.5) and (3.6) show, the jackknife implicitly incor- 
porates the g; coefficients needed for estimating the 
model-variance. The right-hand side of (3.6) is itself an 
alternative estimator that we will denote by v, Ye): 
Yung and Rao (1996) also derived an approximation to 
the jackknife for the GREG in multistage sampling. For 
single-stage sampling, their approximation is equal to v,, 
defined in (2.7), which is the same as (3.5) if the leverages 
are zero. Duchesne (2000) also presented a formula for the 
jackknife, which he denoted as V, x2» that involved sample 
leverages. The advantage of (3.4) is that it makes clear 
which parts of the jackknife are negligible in large samples. 


Duchesne also presented an estimator, denoted by ae 
that is essentially the same as v,, and is an approximation 
to the jackknife. 

Expressions (3.5) and (3.6) explicitly show how the 
leverages affect the size of the jackknife. Weighted 
leverages, h,,, that are not near zero will inflate v,. 
Depending on the configuration of the x’s, this could be a 
substantial effect on some samples. 

Since h,, approaches zero with increasing sample size, 
V)>Vro»Vssw> and v, have the same asymptotic properties. 
In particular, the jackknife is approximately unbiased with 
respect to either the model or the design and is robust to 
misspecification of the variances in model (1.1). However, 
the factor (1 -h,,) in (3.6) is less than or equal to | and 
will make the jackknife larger than the other variance 
estimators. This will typically result in confidence intervals 
based on the jackknife covering at a higher rate than ones 
USINZ Vp5>Vgcw> OF Vz. 

Note, also, that if a without-replacement sample is used, 
and some first-order or second-order selection probabilities 
are not small, the choices, vp,Vp,V,, and v, will be 
Over-estimates of either the design-variance or the 
model-variance. To account for non-negligible selection 
probabilities, we can make some simple adjustments. An 
adjusted version of v, (Y,), patterned after vecy,, is 


] (1 -,)8; ie 


BG ne (1 -h,,? 


This expression is similar to V,~3 Of Duchesne (2000), 
although geo omits the leverages. Expression (3.6) also 
suggests another alternative that is closely related to an 
estimator of the error variance of the best linear unbiased 
predictor of the mean under model (1.1) (see, Valliant et al. 
2000, chapter 5). This estimator is somewhat less 


conservative than (3.6), but still adjusts using the leverages: 
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ay 
1 8"; 


Od ibe eG HP 
| N* mr (1 ~hi,;) 

Because h,, = 0(1),v, is also approximately model and 
design-unbiased. A variant of this that may perform better 
when some selection probabilities are large is 


] (1 -1,)8; eg 


‘gaye 
yall c) N2 s ml HA 


4. SIMULATION RESULTS 


To check the performance of the variance estimators, we 
conducted several simulation studies using three different 
populations. The first is the Hospitals population listed in 
Valliant et al. (2000, Appendix B). The second population 
is the Labor Force population described in Valliant (1993). 
The third is a modification of the Labor Force population. 
In all three populations, sampling is done without 
replacement, as described below. These sampling plans will 
test the notion that variance estimators motivated, in part, 
by with-replacement designs can still be useful when 
applied to without-replacement designs. 

The Hospitals population has N = 393 and a single 
auxiliary value x, which is the number of inpatient beds in 
each hospital. The Y variable is the number patients dis- 
charged during a particular time period. The GREG esti- 
mator for this population is based on the model 
E,,(Y) =B,x'? +B, x, var,,(Y) = 0? x. Samples of size 50 
and 100 were selected using simple random sampling 
without replacement (srswor) and probability proportional 
to size (pps) without replacement with the size being the 
square root of x. For each combination of selection method 
and sample size, 3,000 samples were selected. The esti- 
MALTS WY MPV Vecyis Vigo Vp ps Vigvypeand -V, 
were calculated for each sample. For comparison we also 
included the m-estimator, Y, = r /N .The variance estimator 
v was included but is not reported here since results were 
little different from v,,. 

The Labor Force population contains 10,841 persons. 
The auxiliary variables used were age, sex, and number of 
hours worked per week. The Y variable was total weekly 
wages. Age was grouped into four categories: 19 years and 
under, 20-24, 25-34, and 35 or more. The model for the 
GREG included an intercept, main effects for age and sex, 
and the quantitative variable, hours worked. A constant 
model-variance was used. Samples of size 50, 100, and 250 
were selected. The two selection methods used were srswor 
and sampling without replacement with probability propor- 
tional to hours worked. (This population has some cluster- 
ing but this was ignored in these simulations.) 

The third population was a version of Labor Force 
designed to inject some outliers or skewness into the 
weekly wages variable. We denote this new version as 
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“LF(mod)” for reference. In the original Labor Force 
population, weekly wages were top-coded at $999. For each 
such top-coded wage, a new wage was generated equal to 
$1,000 plus a lognormal random variable whose 
distribution had scale and shape parameters of 6.9 and 1. 
Recoded wages were generated for 4.4% of the population. 
Prior to recoding, the annualized mean wage was $19,359, 
and the maximum was $51,948; after recoding, the mean 
was $23,103 and the maximum was $608,116. Thus, 
LF(mod) exhibits more of the skewness in income that 
would be found in a real population. 

The resulting LF(mod) distribution is shown in Figure 1 
where weekly wages is plotted against hours worked for 
subgroups defined by age. In each panel the black points 
are for males while the open circles are for females. A 
horizontal reference line is drawn in each panel at $999. 
Although there is a considerable amount of over-plotting, 
the general features are clear. Wage levels and spread go up 
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as age increases, hours worked per week is related, though 
somewhat weakly, to wages, and wages are most skewed 
for age groups 25-34 and 35+. Less evident is the fact that 
wages for males are generally higher than ones for females. 

Table 1 shows the empirical percentage relative biases, 
defined as the average over the samples of (T - T)/T for 
the z-estimator and general regression estimator for the 
various populations and sample sizes. Root mean square 
errors (rmse’s), defined as the square root of the average 
over the samples of (T = Tale are also shown. In the 
Hospitals population, both estimators have negligible bias 
at either sample size. The GREG is considerably more 
efficient in Hospitals than the m-estimator because of a 
strong relationship of Y to x. In the two Labor Force popu- 
lations, both the m-estimator and the GREG are nearly 
unbiased while the GREG is somewhat more efficient as 
measured by the rmse for all sample sizes and selection 
methods. 
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Figure 1. Scatterplots of Weekly Wages versus Hours Worked per Week in Four Age Groups for the LF(mod) population. Open circles 
are for females. Black circles are for males. A horizontal line is drawn at $999 per week, the maximum value in the original 


Labor Force population. 
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Table 1 
Relative biases and root means square errors (rmse’s) of the 
m-estimator and the general regression in different simulation 
studies of 3,000 samples each. 
Hospitals Labor Force LF(mod) 
n=50 n=100 n=50 n=100 n=250 n=50 n=100 n=250 
Simple random samples 


Y 
rene (20-29-01 0:6 0 0 -0.1 0 -0.3 
rmse US SOG! BE PEIN NGS" RO) ONL BRR! 
¥, 
Relbias(%) 0.2 0.2 O41 OE OL O4E Wye 0) 
rmse Soy Aligly  Aevsie wile eg lah ~ KOs es slo 
Probability proportional to size samples 
‘ 
Relbias(%) -0.1 0.1 -0.5 0 0 0 -0.1 -0.1 
rmse 37.6 244 28.2 203 12.6 806 546 34.1 
¥, 
Relbias(%) 0.1 0.1 -0.10 0.10 O -0.6 -0.7 -04 


rmse PUY PD ANS) PR SYS IPO SSSI Sioa yeh 


Table 2 lists the empirical relative biases (relbiases) of 
the nine variance estimators, defined as 100(v - mse)/mse, 
where y is the average of a variance estimator over the 
3,000 samples and mse is the empirical mean square error 
of the GREG. The rows of the table are sorted by the size of 
the relbias in LF(mod) for srswor’s of size 50, although the 
ordering would be similar for the other populations, sample 
sizes, and selection methods. In the Hospitals population, 
the sampling fraction is substantial, especially when n=100. 
As might be expected, this results in the estimators that 
omit any type of finite population correction (fpc)—Vv,,, 
Veo ae and v ,~ being severe Over-estimates in either 
srswor or pps samples. Because v,, lacks a term to reflect 
the model-variance of the nonsample sum, it under- 
estimates the mse badly when the sampling fraction is large. 

In the Labor Force and LF(mod) populations, increasing 
sample size leads to decreasing bias. The estimators v_, 
Vere Yao. and Vee have negative biases that tend to be less 
severe as the sample size increases. The jackknife v, and its 
variants, Vv, ,V;p, are over-estimates, especially at n=50. 
The estimators, Vp and Vpp» are more nearly unbiased at 
each of the sample sizes than most of the other estimators. 

The empirical coverages of 95% confidence intervals 
across the 3,000 samples in each set are shown in Table 3 
for the Hospitals population. The three choices of variance 
estimator that use the leverage adjustments but not 
fpc's—v,,v,;, and v,;~—are larger and, thus, have higher 
coverage rates than ug and Vater The tendency of the 
jackknife to be larger than other variance estimates for the 
GREG has also been noted by Stukel, Hidiroglou, and 
Sarndal (1996). This is an advantage for the smaller sample 
size, n=50. When n=100 and the sampling fraction is 
large, the estimators with the fpc’s—Vv_.,Vgsy>Vpp» and 
v, ,—have closer to the nominal 95% coverage rates while 
Veo>Yp»Vy> and v, cover in about 97 or 98% of the 
samples. The estimator v,,, that approximates the 
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jackknife but includes an fpc, is a good choice at either 
sample size or sampling plan. 


Table 2 
Relative biases of nine variance estimators for the general 
regression estimator in different simulation studies of 3,000 
samples each. 
Hospitals Labor Force LF(mod) 
n=50 n=100 n=50 n=100 n=250 n=S0 n=100 n=250 
Simple random samples 
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Table 3 
95% confidence interval coverage rates for simulations using the 
Hospitals population and nine variance estimators. 3,000 simple 
random samples and probability proportional to size were selected 
without replacement for samples of size 50 and 100. L is percent 
_of samples with(Y, - Y)/v'?<-1.96; M is percent with 
[¥,-Y|/v'?< 1.96; Uis percent with (Y, - Y)/v'?<1.96. 


n=50 n=100 
L M U L M U 
Simple random samples 
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Tables 4 and 5 show the coverage rates for the Labor 
Force: and LF(mod) populations. For the former, v,,,V,, 
Vy¥ fy p» and v, are clearly better in Labor Force at n=50 
for both srswor and pps samples. But, for n=250, coverages 
rates are similar for all estimators. The purely design- based 
estimator, Vas is unsatisfactory at the smaller sample sizes 
for either sampling plan. As in Hospitals, v,» gives near 
nominal coverage at each sample size in the Labor Force 
population. 

The most striking results in Tables 4 and 5 are for 
LF(mod) where all variance estimators give poor coverage. 
Coverages range from 78.0% for the combination (ve 

n=50, srswor) to 90.7% for (v, and v VSO) pps). 
Virtually all cases of non-coverage are because Gin. -Y)/ 

v'? < -1.96, where v is any of the variance estimators. The 
poor coverage rates occur even though the z-estimator and 
GREG are unbiased over all samples (see Table 1) and, in 
the cases of v,,vyp, and v,, the variance estimators are 
overestimates (see Table 2). 


Table 4 
95% confidence interval coverage rates for simulations using the 
Labor Force and LF(mod) populations and nine variance 

estimators. 3,000 simple random samples were selected without 

replacement for samples of size 50 and 100. L is percent of 

_ samples with (Y, - ¥)/v'?<-1.96; M is percent with 
|¥o-¥|/v'?< 1.96; Uis percent with (Y, - Y)/v'>1.96. 
n=S0 n=100 n=250 
lb M U Ve M U Ib M U 
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Negative estimation errors, Y, - Y, occur in samples 
that include relatively few persons with large weekly 


wages. Figure 2 is a plot of t-statistics based on 


(Y, - Y )/yv,>, versus the number of sample persons with 
weekly wages of $1,000 or more in sets of 1,000 samples 
for (srswor; n=50, 100,250). The negative estimation 
errors in samples with few persons with high incomes lead 
to negative t-statistics, and confidence intervals that miss 
the population mean on the low side. The problem 
decreases with increasing sample size, but the convergence 


Vyp, Le. 


to the nominal coverage rates is slow and occurs “from the 
bottom up.” Regardless of the variance estimator used, 
coverage will be less than 95% unless the sample is quite 
large. 


Table 5 
95% confidence interval coverage rates for simulations using the 
Labor Force and LF(mod) populations and nine variance 
estimators. 3,000 probability proportional to size samples were 


selected without replacement for samples of size 50, 100 and 250. 
Lis percent oLamps with (Y, - Y)/v'?<-1.96; Mis percent with 


IY, -¥|/v'?< 1.96; U is percent with (Y, - ¥)/v'?>1.96. 
n=50 n=100 n=250 
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We also examined how well the variance estimators 
perform, conditional on sample characteristics. We present 
only results related to bias of the variance estimators to 
conserve space. For the Hospitals population, we sorted the 
samples based on Darl Tn T ,), which is the sum of 
the differences of the 2-estimates of the totals of x 7 and x 
from their population totals. Twenty groups of 150 samples 
each were then formed. In each group, we computed the 
bias of ip. along with the rmse, and the square root of the 
average of each variance estimator. The results are plotted 
in Figure 3 for srswor with n=50 and 100 and for pps with n=50 
and 100. A subset of the variance estimators is plotted. The 
horizontal axis in each panel gives values of D... Since 
vay viv. p> and v,, are similar through most of the range 
of D,, only the jackknife v, is plotted. Also, Vpp and Ving 
are close, and only the latter j is plotted. The GREG does 
have a conditional bias that affects the rmse in off-balance 
samples. The poor conditional properties of v, are most 
evident in the simple random samples where the bias of v_ 
as an estimate of the mse runs from negative to positive 
over the range of D.. Among the other variance estimates, 
conditional biases are similar to the unconditional biases in 
Table 2. Both v,, and Vecyw are in theory approximately 
design and model-unbiased, and both track the rmse well. 
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Figure 2. Plot of t-statistics versus the number of sample persons with weekly wages greater than $1,000 in the sets of 1,000 simple 
random samples of size n =50, 100, 250 from the LF(mod) population. Horizontal reference lines are drawn at +1.96. 
Points are jittered to minimize overplotting. 
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Figure 3. Plot of conditional biases, rmse’s, and means of standard error estimates of the GREG for the samples from the Hospitals 
population. Horizontal and vertical reference lines are drawn at 0. The lowest curve each panel is the bias of the GREG. The 
thick solid line is the conditional root mean square error. 
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Figure 4. 


Plot of conditional biases, rmse’s, and means of standard error estimates of the GREG for the samples from the Labor Force 


population. Horizontal and vertical reference lines are drawn at 0. The lowest curve in each panel is the bias of the GREG. 
The thick solid line is the conditional root mean square error. 


Figure 4 is a similar plot for the samples from the Labor 
Force population. The following sets of estimates are very 
similar and only the first in each set is included in the plots: 
(VE) np planand (Vea Vyrs Vip)s Only the srswor and pps 
samples of size n = SO and 250 are included. The horizontal 
axis is again D_, which is the sum of differences between 
the m-estimates and the population values of the totals for 
age and sex groups and the number of hours worked per 
week. The conditional bias of v_ is evident in samples with 
the smallest values of D, but the problem diminishes for 
the larger sample size in both srswor and pps samples. The 
jackknife v, is, on average, the largest of the variance 
estimators throughout the range of D,. The differences 
among the variance estimates and their biases are less for 
the larger sample size. The estimators v,, Vesw> and v, all 
track the rmse reasonably well except when D. is most 
negative, where all are somewhat low. 


5. CONCLUSION 


A variety of estimators of the variance of the general 
regression estimator have been proposed in the sampling 
literature, mainly with the goal of estimating the design- 
based variance. Estimators can be easily constructed that 


are approximately unbiased for both the design-variance 
and, under certain models, the model-variance. Moreover, 
the dual-purpose estimators studied here are robust esti- 
mators of a model-variance even if the model that motivates 
the GREG has an incorrect variance parameter. 

A key feature of the best of these estimators is the 
adjustment of squared residuals by factors analogous to the 
leverages used in standard regression analysis. The desira- 
bility of using leverage corrections to regression variance 
estimators in order to combat heteroscedasticity is well- 
known in econometrics, having been proposed by 
MacKinnon and White (1985) and recently revisited by 
Long and Ervin (2000). One of the best choices is an 
approximation to the jackknife, denoted here by v,p, that 
includes a type of finite population correction. 

The robust estimators studied here are quite useful for 
variables whose distributions are reasonably “well 
behaved.” They adjust variance estimators in small and 
moderate size samples in a way that often results in better 
confidence interval coverage. However, they are no defense 
when variables are extremely skewed, and large obser- 
vations are not well represented in a sample. Whether one 
refers to this problem as one of skewness or of outliers, the 
effect is clear. A sample that does not include a sufficient 
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number of units with large values will produce an estimated 
mean that is too small. A variance estimator that is small 
often accompanies the small estimated mean. As the 
simulations in section 4 illustrate, in such samples even the 
best of the proposed variance estimators will not yield 
confidence intervals that cover at the nominal rate. The 
transformation methods of Chen and Chen (1996) might 
hold some promise, but that approach would have to be 
tested for the more complex GREG estimators studied here. 

The most effective solution to the skewness problem 
does not appear to be to make better use of the sample data. 
Rather, the sample itself needs to be designed to include 
good representation of the large units. In many cases, 
however, like a survey of households to measure income or 
capital assets, this may be difficult or impossible if auxiliary 
information closely related to the target variable is not 
available. Better use of the sample data employing models 
for skewed variables may then be useful (see, e.g., Karlberg 
2000). 
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APPENDIX: Details of Jackknife Calculations 


Using (3.2), (3.3), and the standard matrix result in 
Lemma 5.4.1 of Valliant et al. (2000), we have 


i; A’xx/ Ao Vel, 
PRR Le SPB ed 
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From this and the definition of B, the slope estimator, 
omitting unit i, is B,, Bach ye Q, where 


AAsX; r; 


The GREG estimator, after deleting unit /, is 
Y. es 
A n A i A n A i 
T..... = ——| T_-— |+\B-Q.]’ |T.-——| T.-— |] |. 
ow) 2| . | | Q) cs ms 4 


After some rearrangement, this can be rewritten as 
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and 


= (B- ay |= -t|} 


I, follows. that. Taga gy saute 1) D,— D2) + 


n(n-1)'F, where F, =(G,-G,)+n7' (K,- K,) with 


G, and K, being sample means with the obvious 
definitions. Substituting in the jackknife formula (3.1) 
gives 

v, (Yo) ig cats 


p> (D; > D,y ss pz F; ax yD F; (D; a D,). eK) 


Formula (A.1) is exact, but with some further 
approximations we can get the relative sizes of the terms. 
Using the values of G, and K, above and the fact that h,, 
and the elements of Q, are o(1), we have 


ni h,Y,-Y, ieee NX eos 
Gn Ke eB Oy 
m (1- hj) on ti 

a ce 

peter, Bia! By Ty, 
Tait 

mene ze 
n 


where ~ denotes “‘asymptotically equivalent to.” It follows 
thatiul, =Omanditthaten,(¥_) = e(De- Di aes n(3:5) 
holds. 

Next, we can show that the second term in (3.5) 
converges in probability to zero. The vector of residuals can 
be expressed as r. = as H eS a the second term in 
(3.5) is equal to N n g’ Il, 6. trees Tg. where 
U =diag(1 -h,,),1€s. Tas, the second, term in (3. 5) i is the 
square of BNC In? g'TT'U'r, which has 
expectation zero under any model with pe ay "Ora he 
model-variance of B is 


Ne var (oy evr) = 


N?n‘g/1;'U! (1-H,)x 


(A.2) 
V.(i-H,)'U'N,'g, 


which has order of magnitude n ~* under the assumptions 
we have made. Consequently, the second term in (3.5) is the 
square of a term with mean zero and a model-variance that 
approaches zero as the sample size increases. The second 
term in (3.5) then converges to zero by Chebyshev’s 
inequality. This justifies (3.6). 
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In This Issue 


This issue of Survey Methodology includes papers on a variety of topics including overviews of 
small area statistics and data quality in statistical offices, survey nonresponse and imputation, survey 
design, data collection and estimation. 

In the first paper of this issue, Brackstone identifies strategies and approaches for the 
development of small area statistics programs in national statistical offices. The topic of small area 
estimation will be covered by a number of papers in a special section in the June 2003 issue of 
Survey Methodology. The paper first considers the crucial role of censuses, and discusses issues 
related to their usefulness for small area statistics. Other potential sources of small area statistics 
include administrative files and sample surveys, either on their own or combined with census data 
to provide estimates for the intercensal period or for characteristics not directly covered by the 
census. Rolling censuses are also discussed, as well as the unique challenges in producing small 
area business and environmental statistics. Finally, issues of organization of national statistical 
offices for production and dissemination of small area statistics are considered. 

Trewin reviews the practices and approaches used to maintain high quality of output from a 
national statistical office. Important ingredients include good relations with respondents, skilled and 
motivated staff, sound statistical and operational methods, and relevance of statistical programs. 
Current challenges include increasing the use of administrative data sources, effective use of the 
internet for both collection and dissemination, maintaining knowledge and skills as staff leave, and 
handling increasing user expectations. This paper is based on a talk presented as the keynote address 
at Statistics Canada’s Symposium 2001. 

Thibaudeau presents an innovative approach to the imputation of demographic characteristics in 
a large scale survey or Census. Instead of relying on the usual approach of either the closest 
complete record in the processing stream or constructing imputation groups, Thibaudeau proposes 
a compromise method which uses maximum likelihood estimation based on the conditional 
probabilities. This approach seeks to create groups that are close in order and in geography to the 
imputed record. He also presents an interesting Bayesian approach to evaluating the method. 

Nandram, Han and Choi consider the problem of analyzing multinomial nonignorable non- 
response data from small areas in the framework of Bayesian inference. This paper extends some 
earlier work by Stasny by assuming a Dirichlet prior underlying the multinomial probabilities and 
using a prior distribution on the hyperparameters. The authors apply this model to Body Mass Index 
data from a complex survey design. 

In the Stewart paper, the possible biases introduced by different contact strategies in telephone 
time-use surveys are investigated. Two contact strategies, convenient-day scheduling, where the 
designated reference day changes with the contact day, and designated-day scheduling, where the 
reference day remains fixed, are discussed and compared through simulation studies. 

Bell and McCaffrey consider the problem of unbiasedly estimating the variance of coefficients 
of linear regressions from multi-stage survey data when only a small number of Primary Sampling 
Units (PSUs) are sampled. After investigating situations where the bias of the linearization variance 
estimator can be large, a bias reduced linearization variance estimator is proposed. In addition, a 
Satterthwaite approximation is used to determine the degrees of freedom to be used for tests and 
confidence intervals in conjunction with the bias reduced linearization variance estimator. 

Sirken considers estimation of the volume of transactions that a population of establishments has 
with a population of households. An approach based on indirect sampling of establishments through 
the households that they have transactions with is compared to the more typical approach based on 
direct pps sampling of establishments. Estimators and expressions for the variances are derived and 
compared for the two methods. Situations where one approach or the other is preferable are 
explored. 

Rivest considers the problem of identifying stratum boundaries. The commonly used Lavallée- 
Hidiroglou algorithm assumes that the values of the study variable are available and are used in the 
determination of optimal stratum bounds. In his paper, Rivest relaxes this assumption and modifies 
the Lavallée-Hidiroglou algorithm to account for a discrepancy between the stratification variable 
and the study variable through the use of models that link these two variables together. These models 
are then incorporated into the Lavallée-Hidiroglou algorithm. 
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In This Issue 


In the Lu and Sitter paper, the problem of the sample size being smaller or only slightly larger 
than the total number of strata is considered. Consequently, conventional methods of sample 
allocation to strata may not be applicable. One solution for this problem is to use a linear 
programming technique to minimize the expected lack of desirability of the samples subject to a 
constraint of expected proportional allocation (EPA). However, as the number of strata increases 
this solution rapidly becomes expensive in terms of magnitude of computation. In the proposed 
approach, the amount of computation is reduced substantially at the small cost of approximate EPA 
for strict EPA. 

Renssen and Martinus explore the use of generalized inverse matrices in survey sampling. After 
reviewing the properties of generalized inverses, they consider the generalized regression estimator 
when the set of regressors is not of full rank, and they set out a regularity condition under which the 
estimator is invariant to the choice of generalized inverse. They then present an algorithm for 
calculating the regression weights, and briefly discuss weighting in the Dutch Labour Force Survey. 


M.P. Singh 
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Strategies and Approaches for Small Area Statistics 


GORDON J. BRACKSTONE!’ 


ABSTRACT 


National statistical offices are often called upon to produce statistics for small geographic areas, in addition to their primary 
responsibility for measuring the condition of the country as a whole and its major subdivisions. This task presents challenges 
that are different from those faced in statistical programs aiming primarily at national or provincial statistics. This paper 
examines these challenges and identifies strategies and approaches for the development of programs of small area statistics. 
The important foundation of a census of population, as well as the primary role of a consistent geographic infrastructure, 
are emphasized. Potential sources and methods for the production of small area data in the social, economic and 
environmental fields are examined. Some organizational and dissemination issues are also discussed. 


KEY WORDS: Small area statistics; Census; Geography. 


1. INTRODUCTION 


The mandate of most national statistical offices (NSO) 
focuses on the monitoring of social, economic, and environ- 
mental conditions at the national level, and for the major 
administrative units (provinces, states, major metropolitan 
areas) within the country. However, the demand for data at 
lower geographic levels is always present, especially from 
local governments and from businesses needing to make 
investment, marketing, and location decisions that depend 
on knowledge of local areas. We will use the term “small 
area Statistics” to mean statistics for areas below the level of 
state, province, or major metropolitan areas — a broad 
spectrum of areas from large towns, through urban neigh- 
bourhoods, to rural villages. In some circles the term “small 
areas” is used more broadly to refer to any small sub-group 
or domain of the population, but here we are talking strictly 
about small geographic areas. 

The extent of an NSO’s responsibility for small area 
statistics depends on the division of governmental responsi- 
bilities within a country. For example, in some countries 
local governments are the creation of provinces and the 
responsibility for supporting their statistical needs may rest 
with provincial governments. But in many countries, 
whatever the formal division of powers, it is, de facto, the 
NSO that is expected to respond to the need for small area 
statistics, either within its own resources or in cooperation 
with other levels of government. At the very least, it is the 
NSO that must set the standards and framework for small 
area data if these are not to become a mishmash of uneven 
and overlapping statistics incomparable across the country. 

With limited budgets an NSO is faced with the difficult 
trade-off between investment in national statistics and 
provision of small area detail. How should it choose 
between covering more subject areas, or existing subject 
areas in more detail, at the national and provincial levels, 
and, on the other hand, providing more small area detail for 
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subject areas it is already covering nationally? There is no 
formula for resolving this problem. The balance struck in 
any country will be largely a function of national needs, 
relative powers, and historical tradition, with perhaps some 
statistical considerations on the margin. Nevertheless, there 
is a series of measures and approaches that a NSO can 
consider to maximize the degree to which it can satisfy 
demands for small area statistics within a limited budget. 

Four potential sources of small area statistical data either 
individually or in combination, account for most production 
of small area data by statistical agencies. Censuses or 
complete enumerations of populations are the traditional 
source. Administrative records, including national registers, 
that cover all, or almost all, of a defined population are in 
many respects equivalent to a census. National sample 
surveys are rarely large enough to produce small area data 
directly but they do represent a valuable current source of 
information that can be used, under certain assumptions and 
in combination with other sources, to produce small area 
data. And finally, local studies focused on particular small 
areas will produce small area data, but not for complete sets 
of small areas. Sources such as satellite imaging or aerial 
photography can be thought of as censuses or local studies 
depending on their coverage. 

In this paper we first review the important role of the 
Census of Population, with or without a population register, 
in the provision of small area socio-economic data (Section 
2), and then emphasise the fundamental role of an up-to- 
date geographic infrastructure to support any production of 
small area statistics, including especially the census of 
population (Section 3). We then examine approaches to 
providing small area data on individuals and families 
between censuses (Section 4), on business activities 
(Section 5), and on environmental issues (Section 6). We 
conclude with some general observations about the 
dissemination of small area statistics and the management 
of small area statistics within an NSO. 


Gordon J. Brackstone, Informatics and Methodology Field, Statistics Canada, Ottawa, Ontario, K1A OT6. E-mail: bracgor@statcan.ca. 
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2. CENSUS OF POPULATION 


The census of population, in most countries, plays the 
central role in the provision of small area data about people, 
families and households. Based on a complete enumeration 
of the population (at least for basic characteristics), its esti- 
mates are free of the sampling error that limits the ability of 
sample surveys to produce small area estimates. Provided 
the individual households are geographically coded to a fine 
level (e.g., a block or block face), direct tabulations of 
households can produce statistical aggregates for any geo- 
graphic area that can be defined, or approximated, in terms 
of the lowest level of geographic coding. 

However, censuses have their drawbacks. They are 
costly, and therefore they are infrequent. Data from the last 
census may provide a poor representation of a small area 
that is undergoing rapid development. In many countries, 
sampling is utilized in the census for many of the questions. 
While this introduces sampling error into estimates from the 
census, these samples are still huge compared with those in 
a typical sample survey. Furthermore, the samples are 
typically spread through every enumeration area of the 
country, so the ability to produce small area estimates is 
maintained, even though the small areas will need to be 
somewhat larger than in a true census. 

Potentially more serious, with respect to accuracy, are 
nonsampling errors such as coverage error and response 
bias. Most censuses miss some people, or count some 
people twice, and it has been repeatedly shown that those 
miscounted are generally not typical of the population as a 
whole. Census estimates may therefore be biased against 
certain sub-groups of the population. If these subgroups 
(e.g., certain immigrant groups) tend to be geographically 
clustered, this can have a serious impact on estimates for 
some small areas. Response bias arises if a census question 
is systematically misunderstood by many respondents. Both 
small area and large area estimates would be affected by 
such errors. 

Countries that maintain a population register have the 
potential to produce census-like data for small areas more 
frequently than the traditional 5-10 year cycles of a census. 
Up-to-date residence registration is clearly a requirement 
for accurate small area data from such registers. The 
breadth of data available from a register system may be less 
than that available through a conventional census, since the 
former is limited to the characteristics maintained in link- 
able administrative registers. In some countries the popula- 
tion register may be used as the basis for a census that 
collects the necessary additional characteristics not avail- 
able within existing registers Redfern (1989) provides a 
useful description of practices within Europe in this regard. 

Since the Census has the potential to produce estimates 
for very small areas, rules to protect against direct or 
residual disclosure of individual data have to be in place. 
These can include imposing a minimum population on areas 
for which data will be released, random perturbation of 


data, suppression of data, or other techniques (Jabine 
(1993), Zayatz, Steel and Rowland (2000)). NSOs have also 
to be concerned about privacy issues arising with the publi- 
cation of small area census data that, while not disclosing 
any individual responses, do reveal dominant characteristics 
of an area (e.g., that 90% of the families received 
unemployment benefits). Such findings cannot be withheld, 
but they can be selected and presented with sensitivity. 

Though a census, with or without a population register, 
is a source of direct small area data as of census day, the 
value of such data declines as time passes. However, the 
role of census data in the provision of small area statistics 
goes well beyond the direct use of the results from each 
periodic enumeration. Inter-censally, census data may be 
used as a benchmark, a sampling frame, or as auxiliary 
information to be used with other sources of data that are 
available between censuses. These usages are pursued in 
section 4. An innovative alternative to the traditional census 
is described in Section 4.4. 


3. GEOGRAPHIC INFRASTRUCTURE 


To enable a national census to produce accurate data for 
small areas, a geographic infrastructure of boundaries and 
mapping capacity covering the whole country is a 
prerequisite. Such an infrastructure requires that each 
dwelling be associated with a precise geographic location 
on the ground, where the degree of precision determines the 
fineness with which small areas can be defined. Though 
modern global positioning technology makes it possible to 
pinpoint each dwelling to a specific pair of coordinates, it 
is usually sufficient for statistical purposes to associate each 
dwelling in an urban area with a block face (i.e., one side of 
a street between two intersections), or a building in the case 
of high-rise buildings. In rural areas, the chosen degree of 
precision will depend on local administrative and natural 
boundaries, though maximum flexibility is preserved by 
using precise coordinates for each dwelling. 

While necessary for a census, a geographic infrastructure 
is equally required for the provision of small area statistics 
from other sources. Essentially each data point, from 
whatever source, has to be associated with a geographic 
location at a level detailed enough to allow aggregation into 
any small areas of statistical interest. For example, if the 
data source is an administrative register, or a business 
register, the address in each record must be convertible into 
a pair of geographic coordinates, or at least into a small area 
within which the address falls. Since administrative 
registers often use mailing addresses, a file that converts 
postal codes into geographic locations is a valuable tool in 
the development of small area data. 

The availability of an accurate up-to-date geographic 
infrastructure, whether maintained by the NSO or obtained 
from outside, is essential if a program of small area 
Statistics is to have flexibility in the choice of areas for 
which statistics are produced. 
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4. SMALL AREA STATISTICS ON PERSONS 
AND HOUSEHOLDS - BETWEEN CENSUSES 


We turn now to the issue of producing small area data 
for persons or households inter-censally. Clearly the exis- 
tence of a current population register makes a fundamental 
difference to what is possible, and how it can be done. We 
will confine ourselves to the case where no regularly 
updated population register exists. 

In such circumstances, there are three main classes of 
approach. The first is to utilize census-like files that come 
from administrative systems and purport to cover the whole 
of a well-defined population. The second is to exploit 
sample survey data and, through additional model assump- 
tions, produce estimates for smaller (though still not very 
small) areas than is possible through direct survey esti- 
mation. The third category is the combination of one or 
both of these first two approaches with the use of data from 
the most recent census. In the following paragraphs we 
review some of the characteristics of these approaches. 


4.1 Administrative Files 


An example of an administrative file with small area 
statistical potential is the annual file of individual income 
tax returns. Other examples, with narrower population 
definitions, might be drivers’ licences, employment insur- 
ance recipients, or health insurance records. In the case of 
tax data, if each record contains a residential address that 
can be associated with a geographic point or small area, 
then data can be tabulated directly for small areas, with due 
regard for confidentiality (as with census data). The charac- 
teristics available would generally be restricted to demo- 
graphic and income variables, and the coverage would be 
limited to taxfilers. Nevertheless, such a file represents a 
rich source of annual data for quite small areas. Population 
coverage can be improved through the imputation of 
dependents “claimed” on the tax record. In Canada, the 
coverage of such imputed files is approaching that of the 
census as coverage increases among low income earners 
who need to file tax returns to obtain social assistance 
benefits. 

With administrative data in general, the statistician has 
to take what is available (though some influence on content 
may be possible in the longer term), reconcile any differ- 
ences in concepts, definition or coverage between the admi- 
nistrative file and the statistical objectives, and assess any 
issues of reporting or coding accuracy in the records. 
Subject to these precautions, administrative data can 
provide a geographically rich potential source of small area 
data (Brackstone 1987). 


4.2 Sample Survey Data 


The problem with sample survey data as a source of 
small area statistics is sample size. There are frequently 
insufficient sample cases in the small area to allow a 
reliable direct estimate to be produced, and sometimes none 


119 


at all. In large national sample surveys it may be possible to 
devise sampling strategies that ensure an acceptable level of 
precision for planned small areas, such as sub-provincial 
regions, without significantly degrading the reliability of 
estimates at higher levels (Singh, Gambino and Mantel 
1994). But for smaller areas, or for areas of similar size not 
taken into account during design, reliable estimation will 
not be possible. Larger samples help, and may allow direct 
estimation for some of the larger small areas, but budgets 
usually constrain this approach as a general solution. If no 
other data sources are available, statisticians can only resort 
to model-based methods which involve making assump- 
tions about how data for a small area relate to other data. 
These methods are often described as “borrowing strength” 
since they borrow information from elsewhere in the sample 
survey to augment the number of units that contribute to the 
estimate for a given small area. The borrowing can be from 
other time periods, from sample units outside the given 
small area, or from other variables measured on the same 
sample unit. Some examples follow. Most of these 
examples will allow some expansion of the range of small 
area estimates that can be produced from sample surveys 
with relatively large samples. They cannot magically 
convert small sample surveys into rich sources of small area 
data. 


is In a monthly survey, it may be possible to combine 
data for a small area over a period of consecutive 
months to produce direct estimates of a multi-month 
moving average for the area. For example, quarterly 
estimates may be possible where monthly ones were 
not. 


2 One may be ready to assume that means or pro- 
portions estimated for a larger area apply equally to 
a smaller component area within it. If the size of the 
small area is known, an estimate can be obtained by 
multiplying by the assumed mean or proportion. 
This assumption may be more realistically made 
within subgroups of the population (¢.g., age 
groups), rather than for the population as a whole. In 
this case, if the size of each sub-group is known for 
the small area, a synthetic estimator can be built up 
by multiplying the sizes by the assumed means and 
aggregating. 


3: If additional related variables are available from the 
survey, more elaborate models may be set up relating 
the variable being estimated to these auxiliary 
variables. The parameters of the model may be 
estimated at a higher geographic level where there is 
sufficient sample to estimate them reliably. The 
model is then applied with the estimated parameters 
to the data for the given small area. 


All of these approaches suffer from the lack of reliable 
baseline data for each small area. If such data are available, 


120 Brackstone: Strategies and Approaches for Small Area Statistics 


for example from a recent census or from administrative 
records, then the data may be used in combination to 
produce more reliable estimates than from either source 
alone. 


4.3 Combined Sources 


Methods that combine census or administrative infor- 
mation from the recent past with current sample survey data 
are borrowing strength from outside the survey. They still 
require model assumptions. However, these can often be 
weaker (since they involve assumptions about change from 
the benchmark, rather than about absolute levels of each 
small area) and so more acceptable, or more plausible, than 
in the case of sample survey data alone. 

A wide variety of estimation methods (which we won’t 
attempt to describe here) have been developed to handle 
this situation. Some of these methods can be thought of as 
estimating change since the most recent benchmark, others 
as distributing reliable current sample survey estimates 
among component small areas based on benchmark data, 
and yet others as recalibrating old benchmark figures to 
new current estimates. In essence, they all involve some 
kind of balancing of three kinds of estimates: (a) high 
variance but unbiased direct current survey estimates for the 
small area in question; (b) low variance current survey 
estimates for some surrounding or comparable larger area: 
and (c ) census-type estimates for the same small area from 
recent administrative data, or a past census, which may 
contain unknown bias due to the source and the time lag. 
Any available auxiliary data can be incorporated to improve 
the accuracy of each component estimate. The way in which 
these three types of estimates are combined is determined 
by the choice of model and model parameters. 

In summary, the methods of this and the previous section 
essentially reduce variance by making use of more data, but 
at the expense of introducing potential bias due to model 
assumptions that will never be exactly correct. It is very 
important to analyse the performance of these methods 
before their use, for example by carrying out the estimation 
process in a census year when direct estimates are available 
for comparison, and periodically thereafter. Model checking 
is becoming an area of increased research activity (Bayarri 
and Berger 2000). For more detailed descriptions of 
available methods in this class see, for example, Purcell and 
Kish (1979); Fay and Herriott (1979); Ghosh and Rao 
(1994); Singh et al. (1994); Schaible (1996); Rao (1999) 
and Gambino and Dick (2000). 


4.4 Rolling Censuses 


An innovative alternative to the census is being investi- 
gated in at least two countries. The method of producing 
small area data based on a large rolling sample has long 
been advocated by Leslie Kish as an alternative to the 
traditional census (Kish 1990, 1998). The sample survey 
“rolls” in the sense that over a long period (e.g., a decade) 
each of the smallest areas for which estimates are required 


would be included once in the sample so as to provide a 
direct estimate for that area once each period. Successively 
larger areas (aggregates of the smallest areas) would be 
represented more often in the sample, allowing either more 
reliable or more frequent estimates for those areas. For even 
larger areas, including provinces and the whole country, the 
accumulated sample would be sufficient to provide reliable 
annual, or more frequent, estimates at certain levels of 
detail. The approach may be considered with or without a 
periodic census to collect basic demographic data against 
which to calibrate the inter-censal survey estimates. 

The rolling census avoids the need for the assumption of 
models, but presumes that unbiased estimates of multi-year 
averages, or asynchronous estimates for different areas of 
the country, are satisfactory alternatives to the simultaneous 
point-in-time estimates of the traditional census. Relative 
cost is also a key factor, especially in the situation where a 
basic census is also carried out. On the other hand, by 
producing reliable annual estimates for many of the larger 
areas, and with much of the content detail of a census, this 
approach could effectively address the issue that census 
estimates can be up to 12 years old before the next ones 
appear. It also responds to mounting concerns over 
increasing difficulties and costs associated with the conduct 
of a traditional census. 

This approach is being tested in the United States under 
the name of the American Community Survey (Alexander 
1999, 2002) and in France where it is referred to as the 
“recensement continu” (Isnard 1999; Durr and Dumais 
2002). 


5. BUSINESS STATISTICS 


The problems of producing small area data for 
businesses are different in many important respects from 
those encountered for data on persons or households. 

Whereas the association of each individual with a “usual 
place of residence” is, for the vast majority of the popu- 
lation, a fairly clear and unambiguous concept (though 
perhaps becoming less clear with the growth of second 
residences, the incidence of prolonged absences away from 
the snow, and more flexible living arrangements), for 
businesses the question of where, geographically, to attri- 
bute various characteristics of a business is less clear in 
many situations. For single establishment businesses where 
all the activity takes place in a single location there is no 
conceptual problem, though there may still be a practical 
problem if the source of information is an administrative 
file that provides, say, an accountant’s address rather than 
the place of business. For some variables, such as 
employment, there may be no major conceptual problem 
even for larger businesses (except perhaps for those 
working in the transportation industry, or certain service 
industries). However, for variables such as revenues and 
profits there can be real questions about how these should 
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be allocated geographically in miulti-establishment 
businesses. The larger the geographic area the smaller the 
problem — location within a province doesn’t matter if one 
is only interested in provincial totals. But, in general, 
geographic attribution rules have to be determined before 
small area estimates for business activity can be considered, 
and for some aspects of business activity small area esti- 
mates may not make conceptual sense. 

While for household surveys the main obstacle to the 
production of small area estimates is sample size, for 
business surveys considerations of confidentiality usually 
constitute the major barrier. The smaller the area, the 
greater the chance that a particular industry will be domi- 
nated by one or a few major companies, thus precluding the 
provision of estimates for that area due to disclosure risk. 
Methods for checking statistical output on businesses to 
recognize potential disclosure risks are fairly well devel- 
oped (Federal Committee on Statistical Methodology 1994) 
but require constant attention on the part of the NSO. The 
confidentiality problem is less of an issue in those industries 
characterized by small units - which may be the same 
industries in which the conceptual problems of the previous 
paragraph are not so severe. In those industries, consider- 
ations of sample size may indeed be the limiting factor, in 
which case the families of methods described in the 
previous section are available. 

A third area of contrast with data on individuals, at least 
for countries that do not maintain a population register, is 
the existence of a relatively up-to-date list frame of 
businesses. This not only provides a base for sampling and 
a source of some auxiliary data for estimation, but also 
constitutes a potential source of direct estimates of business 
demography, at least annually. In many countries the 
currency of the business register is maintained by receiving 
transactions from the business tax system, which itself 
provides an annual census-like source of administrative data 
on business activity. However, use of tax data still requires 
careful consideration of the conceptual, geographical and 
confidentiality issues raised above. 


6. ENVIRONMENT STATISTICS 


Environment statistics provide yet different challenges 
for the production of small area statistics. While some 
environmental issues are national or even global in scope, 
many are by their nature local. Many sources of pollution 
are typically localized with their impacts being felt most 
severely in the neighbourhood of a plant or accident. The 
socio-economic impacts of broader environmental problems 
(e.g., loss of fish stocks) are frequently felt in small and 
often isolated resource-based communities. 

Some environment data are collected from households or 
individuals (e.g., recycling practices, fuel use) and their 
potential as a source of small area data is subject to the 
considerations already described in section 4. Other 
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environment data (e.g., waste generation, environmental 
protection expenditures, use of natural resources) come 
from businesses and would be governed by the consi- 
derations of Section 5. However, a great deal of environ- 
ment data is obtained from physical surveys (e.g., geolo- 
gical, physiographic, hydrographic), from instrument 
measurement (e.g., temperature, air quality, water quality, 
ozone layer thickness), and from direct observation (e.g., 
land use). Different considerations govern the relation of 
these data sources to small area data. 

Because environment data are no respecters of admi- 
nistrative boundaries, the need for a flexible geographic 
infrastructure, emphasised in Section 3, is especially 
important here. Small area geographic identification is 
needed to regroup data to geographical units that are more 
suitable for environmental analysis. For example, the pro- 
duction of waste attributable to a certain type of agricultural 
activity might be aggregated for all of the producers within 
a river basin. Environmental geographic units are either pre- 
defined (ecozones, drainage basins) or dictated by special 
events (areas covered with different thicknesses of ice, land 
areas flooded by heavy rains or spring thaws). In some 
cases, the area studied could be a very small site such as a 
park. 

Physical quantity or quality data can be difficult to 
aggregate or summarize. In some cases, point source data 
such as air quality measures cannot be considered repre- 
sentative of any larger geographic unit. Water quality may 
be summarized or compared by using an indicator, such as 
the number of days beaches are open for swimming, but not 
simply as an aggregate or average of water quality readings. 
For many measures, the focus of interest may be on change 
over time rather than small area comparisons. In other 
cases, sampling and estimation techniques may need to 
make use of spatial analysis techniques such as contouring 
or interpolation. 

The privacy and confidentiality concerns associated with 
environment data depend on their source. Data collected 
from households or businesses, even if they involve 
physical measurements, are protected by the same confiden- 
tiality rules as other data from those sources. Direct mea- 
surements of the stock of natural resources or the quality of 
the environment do not raise these concerns. Cartographic 
representation of spatial patterns may be one way to over- 
come some of the analytical frustrations of data suppression 
for small areas. Choropleth maps (maps which show the 
distribution of variables or characteristics by using colour 
or shading for ranges of the distribution) can explicitly 
represent the ranges implicit in rows or columns that would 
be suppressed in a published table. 

Cross-border pollutant flows and their global effects 
make physical environment data an international issue. 
Cooperation between neighbouring countries is necessary 
to ensure that national boundaries do not impede analysis of 
the impact of physical processes that recognize no such 
boundaries. 
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In summary, the small area dimension is particularly 
important for environment data, not only because a locality 
is frequently the point of interest, but also because data 
must often be reaggregated to geographic areas more appro- 
priate for environmental analysis such as ecozones or water- 
sheds. 


7. ORGANIZATION AND DISSEMINATION 
ISSUES 


Most NSOs are organized by subject-matter area. The 
production of small area estimates cuts across subject- 
matter areas, but requires support from Geography staff for 
geographic infrastructure, from Methodology staff for esti- 
mation and evaluation methods, and perhaps from other 
staff for analyzing and packaging data across subject areas. 
The question of how to organize small area estimation 
within an NSO therefore arises. 

Requiring subject-matter areas to manage small area 
estimation in their areas, with support from methodology 
and geography staff as needed, is a natural choice since they 
should be most in touch with the data requirements and data 
limitations in their subject areas. More of an issue is how to 
package data for small areas for dissemination to users. 
Who should be responsible for pulling together data from 
different subject-matter areas for a particular small area? 
Should this be a regular program, or something that is done 
‘on demand’? Here there are different models to choose 
from — and Statistics Canada has tried most of them over 
the years. 

At some periods in the past a division focussing on 
regional or urban statistics has existed to provide a regional 
focus for statistical data. At times, the census program, 
which is of course the richest source of small area data, has 
spearheaded the production of small area data profiles. At 
other times, an inter-divisional project has been used to 
manage a program of profiles for electoral districts or for 
other geographic areas. At the same time, regional office 
staff have played a key role in pulling together information 
for small areas in response to client requests. None of these 
arrangements has been ideal. The production of profiles has 
typically been a labour-intensive task requiring a broad 
subject-matter understanding and a lot of searching and 
manipulation of data. Despite the existence of standard 
geographic areas, the combination of data based on several 
different geographic bases is usually an issue. Ensuring that 
data for a large number of small areas are properly matched 
and collated can be an arduous quality assurance challenge. 

Pre-planned profiles on paper were never overly 
successful. As a result, a strategy of maximizing respon- 
siveness to client demands as they arose was preferred. 
With recent advances in technology, and broader coverage 
of small area data in the corporate database, a more auto- 
mated approach is possible. A component of the Statistics 
Canada website (www.statcan.ca), called Community 


Profiles, and largely based on Census of Population data, is 
our most recent attempt to make small area data more 
accessible and promises to be a precursor of future 
directions in this field. Some health data for health districts 
are already included, and certain other non-census sources 
of community data are under consideration. 


8. CONCLUSIONS 


The production of small area statistics by an NSO raises 
issues that are qualitatively different from those faced in its 
regular production of national, provincial or other large area 
data. The statistical theory that makes data based on 
sufficient individual measurements inherently reliable for 
large areas (ignoring bias for the moment) begins to break 
down for smaller areas. Unless a current census or admi- 
nistrative source with full coverage is available, this means 
that the NSO has to resort to some model-based help in 
order to provide estimates. Since alternative models can 
produce different estimates, a degree of arbitrariness is 
introduced into estimates, and this may be seen by some as 
undermining the objectivity of a NSO and its methods. The 
fundamental principle of openness and transparency about 
methods, including the choice of any models used and the 
impact of different assumptions, takes on even greater 
importance in the domain of small area estimation. 

On top of this, an NSO should expect that small area 
estimates will come under more focused scrutiny than do 
many large area estimates. Though large area estimates 
receive broader attention, few individuals have the capacity 
to confirm or refute an estimate at the national level. But at 
the local level there will be many who think they know 
what is going on in their town. And typically small area 
estimation does not work uniformly well for all areas. The 
argument that a method works well on average will not 
quell criticism from those areas where it has not worked 
well — unless it has also worked to the local advantage! The 
NSO has to be prepared for the double jeopardy of weaker 
estimates under closer scrutiny. 

If that is not enough already, confidentiality considera- 
tions loom larger at the small area level. The very fact that 
estimates are being produced for local areas highlights the 
potential for identification of individuals even though the 
NSO has taken sufficient precautions to prevent such 
disclosure. Some users of small area data for marketing 
purposes do not help the situation by implying in their 
advertizing that they can target mail to households based on 
individual or household characteristics, when they are 
actually using small area data to distinguish neighbour- 
hoods. Some methods of small area estimation require 
record linkage which may also raise privacy concerns. 
Again a policy of openness and careful review of all such 
applications, at a senior level and before they begin, is 
necessary to ensure that the public benefit outweighs any 
privacy invasion. 
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Despite these potential difficulties, the demand for small 
area data remains high, technology offers new approaches 
to the management and dissemination of small area data, 
and methodological work on small area estimation is an 
active research area among statisticians. While small area 
data will generally not be an NSO’s first priority, the 
relevance of its statistical programs will be magnified many 
times if it is able to cater to the most important small area 
data needs. 
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The Importance of a Quality Culture 


DENNIS TREWIN' 


ABSTRACT 


The reputation of a national statistical office (NSO) depends very much on the quality of the service it provides. Quality 
has to be a core value — providing a high quality service has to be the natural way of doing business. It has to be embedded 


in the culture of the NSO. 


The paper will outline what is meant by a high quality statistical service. It will also explore those factors that are important 
to ensuring a quality culture in a NSO. In particular, it will outline the activities and experiences of the Australian Bureau 


of Statistics in maintaining a quality culture. 


KEY WORDS: Continuous quality improvement; National Statistical Office. 


1. INTRODUCTION 


Fellegi (1996) provides a strong argument that the trust 
in the national statistical agency is how most users judge the 
quality of its statistical products. 


“Credibility plays a basic role in determining the 
value to users of the special commodity called 
statistical information. Indeed, few users can validate 
directly the data released by statistical offices. They 
must rely on the reputation of the provider of the 
information. Since information that is not believed is 
useless, it follows that the intrinsic value and 
usability of information depends directly on the 
credibility of the statistical system. That credibility 
could be challenged at any time on two primary 
grounds: because the statistics are based on 
inappropriate methodology, or because the office is 
suspected of political biases.” 


Trust will not happen unless the culture is right. Culture 
is a word with many meanings but I am interpreting culture 
as “the way we do things”. Core values are important to 
this. They cannot be just statements hanging on the wall. 
They have to be understood. They have to be reflected in 
behaviours, particularly by leaders of organizations. 

The Australian Bureau of Statistics (ABS) places great 
reliance on adherence to its core values. More than any- 
thing, they distinguish us from other survey providers in 
Australia. The core values are: 


— Relevance — regular contact with those with policy 
influence, good statistical planning, which requires 
a keen understanding of the current and future 
needs for statistics, are essential, as is the need for 
Statistics to be timely and relatable to other 
Statistics. 

— Integrity — our data, analysis and interpretation 
should always be objective and we should publish 
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statistics from all collections. Our statistical system 
is open to scrutiny, based on sound statistical 
principles and practices. 


— Access for all — our statistics are for the benefit of 
all Australians and we ensure that equal opportunity 
of access to statistics is enjoyed by all users. 


— Professionalism — the integrity of our statistics is 
built on our professional and ethical standards. We 
exercise the highest professional standards in all 
aspects of ABS statistics. 


— Trust of providers — we have a compact with 
respondents; they are encouraged to provide us with 
accurate information and we ensure that the 
confidentiality of the data provided is strictly 
protected. We keep the load and intrusion on 
respondents to a minimum, consistent with meeting 
justified statistical requirements. 


Adherence to core values is just one element of 
maintaining a quality culture. Part 2 discusses the key steps 
the ABS uses to maintain a quality culture. 

It is now widely recognized that quality is much more 
than accuracy (e.g., Brackstone 1999 and Carson 2000). In 
Part 3, the different dimensions of quality are discussed 
before identifying in Part 4 what I think are some of the 
major quality challenges for the ABS over the medium 
term. Many of these will be shared by other national 
statistical organizations. 


2. TOWARDS A HIGH QUALITY STATISTICAL 
SERVICE 


Quality assurance is a responsibility of all staff in the 
ABS. There is no central “quality management” group 
although Methodology Division is encouraged to be our 
conscience on quality issues — a role it takes on with 
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enthusiasm, sometimes to the annoyance of others. How- 
ever, that is a good sign — they are provoking debate on 
some of the more difficult quality issues. Support from 
senior management for this type of role is very important. 

The key strategies for ensuring a high quality are 
described under six broad headings. 


— A high degree of credibility for the ABS and its 
outputs. 


— Maintaining the relevance of ABS outputs. 
— Effective relationships with respondents. 
— Processes that produce high quality outputs. 


— Regular review and evaluation of. statistical 
activities. 


— Staff who are skilled and motivated to assure the 
quality of ABS outputs. 


2.1 A High Degree of Credibility 


Credibility is fundamental to the effective use of official 
Statistics. Credibility arises from a system of statistics which 
provides an objective window upon the condition of a 
nation’s economy and society. 

The legislative framework within which the ABS 
operates is an important pre-condition for the integrity of 
Australia’s official statistics. The Australian Statistician 
(i.e., the chief executive of the ABS) is guaranteed 
considerable independence by law. This helps ensure that 
the ABS is, and is seen to be, impartial and free from 
political interference. In particular, the independence of the 
Statistician supports his objectivity in determining the 
Statistical work program and determining what statistics are 
published. Although the legal authority is there, it still 
needs to be reflected in the way senior staff behave. 

Government statisticians must not just apply profession- 
alism skills to their work; they must also be seen to adhere 
to high ethical standards, especially with respect to objec- 
tivity and integrity. We are frank and open when describing 
our statistical methods to users; we publish information 
about our performance — for example, in terms of both 
sampling and non-sampling errors, and revision histories 
for key series; we are willing and able to identify and 
address user concerns regarding quality; we are receptive to 
objective criticism and prepared to respond quickly even if 
the problem is one of perception rather than reality. We 
promote good relationships with the media as they have a 
major influence on public opinion of the ABS and its 
outputs. Also, most Australians find out about official 
Statistics through the media. We engage in other user 
education activities aimed at fostering intelligent use of 
official statistics. 

The fact and perception of ABS objectivity are rein- 
forced by our policies of pre-announcing publication dates 
for main economic indicators, allowing very limited pre- 
release of publications (the details of which are in the 
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public domain), and making special data services available 
on an even handed basis to all. 


2.2 Maintaining the Relevance of ABS Outputs 


There can be, of course, tension between (on the one 
hand) being responsive to changing policy needs and (on 
the other) maintaining the continuity of a system of 
statistics that can objectively monitor performance. Senior 
staff of the ABS devote a great deal of attention to 
maintaining personal contact with key users, to gather 
intelligence about policy issues and emerging areas of 
economic, social and environmental concern. This includes 
regular meetings with the most senior staff of the 
government agencies responsible for policy. The Directors 
of our State offices have similar arrangements with State 
officials. That intelligence feeds into strategic planning and 
the reviews of national statistical programs. 

The ABS has a range of other means for communicating 
with the users of statistics, to ensure that our products are 
relevant to their needs. For example, advisory groups 
representing users and experts in various fields provide 
valuable guidance to our statistical activities. 

There may also be some tensions or trade-offs between 
the different aspects of quality. The ABS positions itself at 
the higher accuracy end of the information market, to 
protect the valuable ABS “brand name”. But if, for 
example, there is an urgent demand for data in a new field, 
some aspects of quality may be traded off in order to 
achieve timeliness and relevance. Nevertheless, there is a 
“bar” below which we will not go. Because it is probable 
that the new statistics will be used to inform significant 
decisions or debate, the ABS makes very clear statements 
about the accuracy of the data to help users understand how 
they can be used. On occasion, such new statistics may be 
differentiated from our other products by labelling them 
“experimental” or releasing as an information or occasional 
paper, rather than a standard publication. We regard this 
form of branding as very important to reliable interpretation 
of our statistics. 


2.3 Effective Relationships with Respondents 


An official statistical agency must maintain good 
relations with respondents, especially trust, if it wants them 
to co-operate and provide high quality data. The ABS 
approach includes — explaining the importance of the data 
to government policy, business decisions and public debate; 
a policy of thoroughly testing all forms before they are used 
in an actual survey; obtaining the support of key stake- 
holders; minimizing the load placed on respondents parti- 
cularly by using administrative data where possible; and 
carefully protecting privacy and confidentiality. 

The ABS monitors and manages the load it imposes on 
both households and businesses; we have developed 
‘respondent charters’ for both groups. As well, a Statistical 
Clearing House has been set up within the ABS to 
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coordinate surveys of businesses across government 
agencies (including the ABS), to reduce duplication and to 
ensure that statistics of reasonable quality are produced. 

All ABS forms and collection methods are tested to 
ensure that the data we seek are available at reasonable cost 
to respondents, and the best available methods are used to 
collect them. For business surveys, our units model, classi- 
fications and data items, are designed to be as consistent as 
possible with the way businesses operate. This now corres- 
ponds closely with their reporting for taxation purposes, 
making it easier to integrate survey data with data collected 
for taxation purposes. For household surveys, the extensive 
use of cognitive testing tools within the ABS, and the esta- 
blishment of a questionnaire testing laboratory, have helped 
to improve quality and to reduce respondent load. 
Standards for form design and form evaluation are set out 
in manuals and are promoted and supported by experts in 
form design. 

The ABS uses efficient survey designs to minimize 
sample sizes to achieve a specified level of accuracy, and 
hence total reporting load; we also control selection across 
collections to spread the load more equitably. To take 
advantage of current reforms of the Australian taxation 
system, the ABS is seeking every opportunity to improve 
the efficiency of our sample designs, through the use of 
taxation data as benchmarks, as well as using it as a 
substitute for some of the data now gathered through direct 
collections. We have changed the business unit structure 
used in our surveys to make it consistent with the structure 
used for taxation purposes. 

For household surveys, the introduction of computer 
assisted interviewing has helped to streamline interviewing 
procedures, reduce respondent load, and improve the 
quality of data collected. 


2.4 Processes that Produce High Quality Outputs 


The quality of ABS statistics is underwritten by the 
application of good statistical methods during all stages of 
a collection including the design stages. The ABS has a 
relatively large Methodology Division (about 120 staff) 
which reports directly to the Australian Statistician. The 
Division is responsible for ensuring that sound and defen- 
sible methods are applied to all collections and compila- 
tions. The Methodological Advisory Committee, a group of 
academic experts, provides independent reviews of our 
statistical methods. 

The ABS puts substantial effort into developing statis- 
tical standards, including concepts, data item definitions, 
classifications, and question modules. All ABS surveys 
must use these standards. The standards are supported by 
relevant data management facilities to ensure they are 
accessible and to make it easier to use standard rather than 
non-standard approaches. 

Sample design and estimation methods are the responsi- 
bility of the Methodology Division. Where possible, a “total 
survey design” is used — accuracy requirements are set 
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according to the intended use of the data, and accuracy is 
measured in terms of both sampling and non-sampling 
errors. For example, in business surveys total survey design 
guides the allocation of resources to the intensive follow up 
of non-respondents or the editing of questionnaires; the 
effort for reducing non-sampling errors is optimized 
according to the impact of errors on overall quality. The 
cost to data providers is also taken into consideration. The 
“total survey design” has to be approved by a senior ABS 
committee before it is implemented. 

In recent years, the ABS has made substantial progress 
by applying standardized best practice across surveys. For 
example, business surveys based on the business register 
now draw their frames at a common date each quarter, and 
use a common estimation method to ensure all collections 
have a consistent and complete coverage. Standard rules are 
adopted for frame maintenance, field collection and estima- 
tion, and generalized processing facilities are available to 
support the use of these rules. Standard methods are used to 
allow for “new businesses” not yet included on the survey 
frame. The ABS is thereby able to increase the coherence 
of estimates across different business surveys. 

For household surveys, a master sample system has been 
adopted since the mid 1960’s. The system is updated 
regularly after each five-yearly census, and has been the 
cornerstone for ensuring the accuracy of statistics collected 
from household surveys. 

Achieving quality in surveys is easier when computer 
systems support current best practice. The ABS has 
invested in generalized tools. They have been developed for 
all major processing steps of both business and household 
surveys, including sample frame management, data input 
and editing, imputation, estimation and aggregation. 

The ABS embraces a rigorous continuous quality impro- 
vement approach wherever appropriate. The Australian 
Population Census is a classic example of raising quality 
through a strategy of measuring quality and involving all 
staff in examining and devising solutions to quality 
problems. This approach was applied very effectively at the 
data processing centre for the 1996 and 2001 Censuses. In 
both cases, the centre achieved significant budget savings, 
better quality and an improvement in timeliness. Contin- 
uous quality improvement is also applied to the coding of 
businesses on the business register, and to many other ABS 
processes. 

At the output end of collections, each subject group is 
required to confront its data with other ABS data and with 
external information, to ensure the coherence of our 
statistics. The key macroeconomic data have to be “signed 
off” by the national accountants in meetings established 
especially for the purpose of clearing the statistics. The 
national accountants then have an obligation to use this data 
without further adjustment in the compilation of the 
accounts, enhancing consistency between the national 
accounts and source data collections. More generally, 
confrontation of different data sources is undertaken by our 
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national accountants through use of an ‘input-output 
approach’ to compiling national accounts estimates. The 
new methodology has led to more consistent accounts. 
Furthermore, the data confrontation and balancing process 
at detailed levels have helped to identify data deficiencies. 
Information about quality is fed back to the economic 
collection groups and is resulting in a more focused 
approach to improvements in the quality of source data. 

One important quality improvement initiative that the 
ABS has pursued is the development of an Information 
Warehouse to manage and store all of our publishable data. 
By drawing together different datasets into a single data- 
base, the Warehouse enables our statisticians to confront 
statistics produced from different collections. Furthermore, 
all forms of publication, be they paper based or electronic, 
are to be produced from a single data store, with the objec- 
tive of ensuring that the same data released in different 
products, and at different times, are consistent. 

Another important element of quality management is 
documentation. Good documentation supports review acti- 
vity and facilitates the dissemination of quality information 
to users, so they can assess the fitness of the data for the 
purposes they have in mind. As part of the Information 
Warehouse initiative, the ABS can now enforce standards 
for documentation of the metadata that describe concepts, 
definitions, classifications and quality. 

A relevant and responsive statistical service must do 
more than provide data to clients. The ABS has recently 
strengthened its analytical ability. A team of analysts has 
been set up to develop new measures of socioeconomic 
concepts, to explore relationships between variables and to 
prototype new analytical products. The expanded program 
of analysis work is expected to deliver significant benefits 
in the form of insights into data gaps and quality concerns. 


2.5 Review and Evaluation of Statistical Activities 


Each ABS area is responsible for continuous quality 
review and improvement. For statistical collection areas, 
quality management is supported by sets of performance 
indicators. A standard set of measures has been developed 
to permit a comparison of quality across collections. Tools 
are now being developed to calculate these measures as part 
of our normal survey processes, and the Information 
Warehouse will allow us to store and display the measures. 
The key indicators are also included in the annual reports 
each Branch makes to the ABS Executive for review. 

Quality measures are of interest to the users of statistics. 
The Information Warehouse will improve users’ access to 
information about quality issues. As well, the ABS places 
high priority on helping users understand the quality of data 
and their implications for them, and has adopted active 
education strategies to promote such understanding. As 
highlighted in Lee and Allen (2001), there is much to do to 
improve user understanding of quality. 

Each ABS household survey now includes an evaluation 
program which reviews the effectiveness and efficiency of 
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all survey activities and assesses the extent to which the 
data are used by clients. The Statistical Clearing House 
conducts a review of each ABS business survey. These 
initiatives ensure that all collections are subjected to at least 
a basic evaluation, and brings to light opportunities for 
improvements to quality and efficiency. 

As well as making internal comparisons of performance 
across its own collection areas, the ABS has established a 
benchmarking network with overseas statistical agencies; 
the aim of the network is to share information about survey 
design, processes and costs. The benchmarking exercise is 
providing very useful guidance to the ABS’s efforts to 
improve its processes and outputs. 


2.6 Skilled and Motivated Staff 


The ABS could not provide high quality information to 
its user community if it did not employ people who bring 
skills and energy to our statistical work. The staff are 
responsible for implementing the strategies discussed 
above. They must take a professional approach and be 
committed to the development of new methods, to conti- 
nuous quality improvement, and to the open discussion of 
methods and quality issues. 

Quality improvement and on-going statistical work 
compete for the time and energies of our staff. The ABS 
approach is, as far as possible, to integrate quality work 
with on-going processes and systems. We emphasize to 
staff that quality management is a corporate priority and 
ensure that tools and resources are made available to 
support it. In particular, the ABS is implementing a tighter 
approach to project management; this is being supported by 
manuals, systems and training. 

Statistical training plays an important role in maintaining 
and improving quality. The ABS is always searching for 
new, more effective, approaches to skills development. An 
important element of our performance management system 
is a focus on identifying and addressing individuals’ deve- 
lopment needs. 

Relationships with other national and statistical agencies 
are a very important element of the ABS efforts to 
improving official statistics. The ABS is committed to using 
international standards; we take advantage of the wide 
range of expertise embodied in those standards. On the 
other hand, there is an obligation for us to make a positive 
contribution to the development of the standards. In doing 
SO, we try to take account of the interests of the Asia/Pacific 
region as well as those of Australia. With ever increasing 
globalization of economic activity and the pursuit of world 
wide social goals, the compatibility between Australian 
Statistics and those of other countries, is an important 
element of quality. The ABS maintains strong links with 
many Overseas agencies. We are fortunate that there is a lot 
in common in the challenges we face and there are great 
benefits from sharing experiences with other statistical 
agencies. 
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3. DIMENSIONS OF QUALITY 


Figure 1 is taken from Lee and Allen (2001). Among 
other things, it neatly summarizes, on the left hand side, 
three existing frameworks for judging quality. There are 
some differences with the descriptors used but basically 
they are providing the same message — there is much more 
to quality than accuracy. This is now widely accepted 
although it was not so long ago that discussion of the 
quality of a statistic focussed on its accuracy and the 
sampling variability in particular. 

There are several messages in the right hand side of Figure 
lig 
(i) There are many different ways of compiling official 
statistics — from modelled data/analytical outputs to 
censuses and sample surveys. In Australia we are 
making greater use of administrative data, systems 
of accounts (linked to the national accounts) and 
model based and other analytical methods to 
produce statistical outputs, compared with five 
years ago. The quality challenges differ between the 
different means of compiling statistics. 

There are several groups of activities associated 
with statistical outputs — from “frameworks, 
concepts, standards and classifications” through to 
“services/dissemination”. Each is important in its 
own right and has its own quality challenge. 


(ii) 
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The performance of a National Statistical Office is 
extremely important to its quality image as 
recognized in the opening quote of the paper. A 
number of the elements are specified in Figure 1. 
All are important. Indeed you cannot have a high 
performing statistical office unless you rate well 
against each of these elements; including 
management and financial performance. 


(iii) 


There are other elements such as institutional 
settings (e.g., legislation) which are also important. 


(iv) 


The main purpose in describing the above is to emphasize 
that the list of quality challenges for a national statistical 
office is very large. All have to be tackled in some way — 
this would not be possible unless you have a quality culture, 
i.€., attention to quality is the responsibility of all staff. 
There are many “moments of truth” to genuinely test 
whether a quality culture exists or not. 


4. CURRENT QUALITY CHALLENGES AT ABS 


Psychologists say that it is difficult to grasp more than 
seven points at one time so the remainder of the paper is 
limited to identifying seven major quality challenges for the 
ABS. 
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Figure 1. A Framework for Assessing Quality 
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(1) The increasing use of large, but imperfect, 
administrative and transactional data bases for 
compiling official statistics. 


(ii) Increasing user expectations raising the quality 
‘har’. 


(ii) | Managing the tension between improving business 
processes (which can mean removing those 
responsible for statistical outputs from direct 
involvement with input processes) and maintaining 
or improving the quality of statistical outputs. 


(iv) Quality assurance on electronic outputs. 


(v) The presentation of statistics on the internet, 
including the need to educate the user community 
on quality of official statistics. 


(vi) | Managing the transfer of knowledge and skills with 
an ageing senior management team, many of whom 
will retire over the next 5 years. 


(vii) Use of international statistical standards to maintain 
comparability where the standard may not be the 
most appropriate for national statistics. 


4.1 Increasing Use of Administrative/Transactional 
Data Bases 


We have used administrative data bases for many years 
(é.g., vital registrations for births and deaths, customs for 
trade data) to compile official statistics. Others have been 
used to develop frameworks for statistical collections. The 
issues at hand are the increasing availability of these data 
bases, their under-utilization for statistical purposes, and 
taking advantage of the potential to link across data bases 
and ABS collected data sets using a common identifier 
(e.g., the Australian Business Number for business 
statistics). 

Examples of administrative data bases that are becoming 
available are extended personal and business income tax 
data bases, health insurance transactions, and details of 
those on income support. 

Transactional data bases are becoming available, 
although not in readily accessible form. Data bases of 
particular interest to the ABS are scanner data bases from 
retail outlets and eftpos (i.e., electronic fund transfers 
between customers and retailers) data bases. 

There are some particular advantages in using admi- 
nistrative or transactional data bases: 


— they reduce the compliance cost we impose on 
respondents 


— they are often “censuses” and therefore provide 
scope for producing detailed data sets (e.g., by 
geography) 

— they often have a longitudinal element (e.g., tax 
data) to support this form of analysis 
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— they often contain an identifier which facilitates 
analysis across data sets (e.g., the Australian 
Business Number will facilitate analysis across 
business tax data sets, customs data, and ABS 
surveys) 


— they might be cheaper than directly collected data 
sets. 


There are negatives of course — for example, the defini- 
tions may not be consistent with the preferred statistical 
concepts; less attention may have been given to incoming 
quality; and they may be out of date. Managing privacy 
aspects is a particularly important element. Although our 
motives are entirely honourable, and are in the public 
interest, matching data bases is a sensitive issue and ignored 
at our peril. Many of our users, particularly those in the 
academic community, are not as sensitive to these concerns. 

There is also the question of whether the ABS should 
produce the statistical outputs or the agency responsible for 
the data sets. A number of issues come into consideration 
— the importance of the outputs to the national statistical 
service, costs, the extent to which quality can be managed 
and the basic question of whether the administrative agency 
is prepared to give up custodianship. Only the most 
important data sets will be brought into the ABS for 
compiling official statistics; for the others, we will work 
with the administrative agency to help them deliver “fit for 
purpose” statistical outputs into the public domain. 

What have been our key responses to this important 
quality management issue? 


— Weare developing protocols for the publication and 
management of data from administrative sources. 
Associated with this is the promotion and support of 
good statistical and data management practices. 


— For each statistical field, we are preparing infor- 
mation development plans in conjunction with other 
stakeholders which identify those areas of greatest 
importance and set out specific activities which will 
lead to increased availability of non-ABS data, 
particularly quality management issues. 


— Weare actively promoting good practice in infor- 
mation management. 


— A major investment project has been the greater 
utilization of taxation data to provide cost- effective 
Statistics. 


— We are investigating methods for assuring the 
quality of the very large but imperfect data sets that 
are available through administrative and trans- 
actional data holdings. 


4.2 Increasing User Expectations 


User expectations on quality are changing — they are 
much higher than what they were as recently as 5-10 years 
ago. This trend is likely to continue. The increasing 
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globalization of financial markets will mean that key 
macroeconomic statistics have international, as well as 
national prominence. 

There is a perception that statistics have become more 
volatile. In some cases they have because the underlying 
phenomenon has become more volatile. However, we do 
not believe statistical measurement methods are a signi- 
ficant contributing factor — in most cases methodological 
developments have led to improvements although the 
perception may be different. For example, the volatility in 
the key national accounts series is considerably less than 
what it was 10-15 years ago yet this is quite different to the 
perception of some users. 

We also receive more criticism of inaccuracies in very 
detailed data (e.g., Population Census tables) than 
previously. Again, it is not that the quality is deteriorating 
— it is that the expectation is higher. 

We have to accept that “the bar is rising” and do what we 
can to improve quality to the expected level. That is not 
always possible of course so managing expectations is 
important. This can be done by: 


— providing good explanations of the strengths and 
weaknesses of particular data sets; 


— talking to key users whenever possible about the 
strengths and weaknesses of data series; 


— responding to their informed criticism (seek 
partnerships in improving quality e.g., in our 
detailed foreign trade statistics we openly seek 
feedback from users on the quality of the statistics); 
and 


— providing as much explanation as possible for 
statistics that might seem unusual or different to 
expectation. 


4.3 Improving Business Processes 


Like several statistical organizations, the ABS is looking 
at how it might use new technologies, and other elements 
such as increased access to taxation data, to improve the 
efficiency of its business statistics processes. 

We are also investigating the business processes asso- 
ciated with household surveys, particularly as increased use 
is made of computer assisted interviewing (CAI). However, 
in this section the paper will concentrate on the changes we 
are making to the way we manage business statistics to 
describe this particular quality challenge. 

A team was set up to look at the possibilities. As a 
consequence, a number of significant changes were agreed 
to — this is to be known as the Business Statistics Innovation 
Program. We are looking at revised business processes that 
will be in place for at least 10 years and will yield a signi- 
ficant return on the investments required to set it up. We 
will: 

— extend the responsibilities of the Business Register 
Unit to capture and store taxation data with a direct 
link to the Business Register through the Australian 
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Business Number (ABN). The ABN is now allo- 
cated through the taxation registration scheme and 
is available with most business transaction data 
bases. The data will be stored in a way that it can be 
used by the various ABS statistical areas to compile 
statistics directly from taxation data or in combi- 
nation with ABS survey data; 


— improve the way we manage business respondents 
— this will include some preference in how they 
provide data to us; 


— set up an input data warehouse, with the Australian 
Business Number as the link across the various data 
sets; 


— establish a business statistics processing environ- 
ment based around the input data warehouse; and 


— increase centralization of a number of the functions 
associated with compiling business statistics. 


We can see the positives in these developments — more 
efficient delivery of business statistics, enhanced use of 
taxation data and other administrative data, data bases that 
support a wider range of statistical analysis. However, it 
will reduce the level of contact that statistical output areas 
have with their input data sources. What impact will that 
have on quality? What strategies can we deploy to mitigate 
the impact? These are important questions that we will have 
to answer. It is the main risk we will have to manage in 
implementing the Business Statistics Innovation Program. 


4.4 Quality Assurance on Electronic Outputs 


Great care is taken on the quality of our paper products. 
This has been built on many years of experience. Our 
record is good and the quality assurance processes well 
embedded in the way we go about our business. Yet, more 
and more of user community receive their data in electronic 
form only. They will make analyses based on these outputs 
often leading to important decisions being made. It is just 
as embarrassing to us to have errors in electronic outputs as 
to have them in paper outputs. 

Our quality assurance procedures for electronic outputs 
are not as sophisticated, but they are evolving. The key 
responses have been as follows: 


— Our data warehouse supports the storage of all the 
objects associated with the dissemination with a 
particular set of statistics, including data cubes and 
meta data. 


— Statistical areas are asked to approve each object — 
they are individually developing their own 
techniques for quality assurance (but sharing ideas 
on best practice). 


— A publishing system has been developed to support 
the simultaneous release of all outputs. If they are 
delivered from the same set of objects, there is less 
chance of inconsistency between the outputs. 
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4.5 The Presentation of Statistics on the Internet 


Ultimately the user can only make judgements about the 
fitness of a statistical output for their purposes. These vary 
of course and what might be fit for one purpose may not be 
for another. There is an obligation on us to provide a range 
of supporting information on data outputs, including that on 
quality, so that the statistical users can make their own 
judgements on fitness of use. There are a number of 
existing, well proven practices relating to declarations about 
the quality of statistics. These activities are now a routine 
part of existing dissemination practices. They include: 


— Concepts, Sources and Methods publications that 
describe in detail the methods used to compile 
major statistical outputs. These are available on our 
web site as well as on other media. 


— An assortment of Information and Working Papers, 
and feature articles in publications, which are used 
to draw attention to issues specific to particular 
Outputs or changes that are being made to their 
compilation methods. 


— A policy of “no surprises” when there are 
significant changes to the methods used for the 
compilation of statistical series. As well as 
Information Papers etc, if there are important 
changes to statistical series, we embark on a 
program of seminars and bilateral discussions with 
key users to explain the changes and the reasons for 
their changes. 


— Material on methods is included in all our 
publications. The ordering and physical 
presentation of this information is according to 
agreed standards. These were developed following 
research undertaken for us by a communications 
consultant on how our users use the material in 
statistical publications. 


— The analysis section of our publications includes 
material that explains, among other things, large or 
unusual movements in our statistical series. Often 
this will be based on information that is only 
available to ABS staff through their contact with 
respondents or their intimate knowledge of the 
methods used in compiling statistics. Our User 
Groups have advised that this is one of the most 
valuable forms of analysis that we can undertake. 


We believe that our key users have a reasonable 
understanding of the quality of the statistics they use. 
However the increased reliance on electronic dissemination 
poses new challenges. In one sense this move provides a 
wonderful opportunity to present a range of information on 
quality that is easily accessible through a few well-designed 
“clicks”. But because information about the quality of the 
Statistics is “not in your face” like it can be in hard copy 
publications it is easier for users to avoid the key messages 
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that you are trying to convey. The real challenge for us is 
to develop methods for presenting quality in a way that is 
not easy for users to avoid the main messages we want to 
convey. 

One means of doing this may be to provide separate 
messages that draw attention to particular information you 
want to transmit on quality. These could be automatically 
activated as particular statistical series are accessed or could 
be delivered by a separate email message. Research is 
required into the most effective means. 

Lee and Allen (2001) have described some of our 
research work to date on this issue . The work is still at the 
exploratory stage. Things that are being investigated are: 


— Usability testing of how users prefer to access 
information on quality. 


— Showing leadership and developing user education 
programs on how to use information on quality. A 
trial version of the is now available. 


— The development of four prototype tools to assist 
users understand the quality of particular statistics. 
The four prototype tools are “Quality Issue 
Summaries”, “Quality Measures”, “Data Accuracy” 
and “Integrated Access to Data and Metadata’. 


More details are available in Allen (2001). 
4.6 Managing the Transfer of Knowledge and Skills 


Like several other national statistical organizations, 
many of the ABS management team, and other senior staff, 
are aged in their 50’s. Some have retired in recent years. 
Others are expected to over the next few years. If managed 
correctly, this is a great opportunity to refresh the 
organization through providing new blood to management 
positions. These will normally be younger staff who will 
bring new ideas and energy into the management team. 

On the other hand, experience and know-how will be 
lost. Both sides of this equation need to be managed 
carefully. Our strategy is as follows. 


— We have developed special programs for those staff 
with potential. Specifically, they undertake a 
leadership and management development program 
which has been specially customized for the ABS. 
Staff are chosen for these programs by senior 
managers. You cannot select yourself to be a 
participant in the program. Furthermore, after staff 
have completed the program they can be expected 
to be chosen for a special assignment or rotated to 
a new position. The underlying philosophy is that 
the best way of learning is to obtain a variety of 
work experiences. A very high proportion of recent 
promotions to senior management positions have 
been participants in these programs. So far this has 
helped us to adequately cover the gaps created by a 
larger number of retirements than in the past. 
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— We retain links with retired ABS staff through a 
variety of informal and formal means (e.g., social 
functions, including them on the distribution list for 
ABS News, etc). Their knowledge is accessible if 
required. 


— We have placed a stronger emphasis on knowledge 
management, using the facilities of our groupware 
product (Lotus Notes), means that key parts of our 
work are well documented and easily accessible. 


— We have made substantial moves to standardize 
methods and systems meaning there is less 
dependence on local knowledge. 


— For some key positions (e.g., Director of National 
Accounts) we ensure shadowing of work prior to 
the retirement of the incumbent. 


To date we have managed this transition well. We have 
been able to adequately fill vacant senior positions and at 
the same time refresh the organization by promoting staff 
with fresh ideas. There is a need to remain adroit. 


4.7 Use of International Standards 


Our starting position is that where international standards 
exist we should use them. This has not always been the 
case. For example, although our industrial classification has 
been loosely based on ISIC, and a concordance developed 
with ISIC, the classification is largely homegrown 
reflecting the specific interests of Australia and New 
Zealand. We have agreed to use the 2007 version of ISIC, 
at least for the upper two levels, with variations at lower 
levels only where there are specific circumstances that 
justify it. 

There are often pressures on us to divert from 
international standards. Sometimes this is to make the 
Australian situation look better. In other cases, such as with 
the ILO unemployment definition, the pressure is because 
the international definition does not seem to reflect the real 
situation in Australian circumstances. We resist these 
pressures but it is important that we have a well docu- 
mented international standard as a reference point to justify 
our position. Nevertheless, where diversions from the 
international standard are made on an exception basis, they 
need to be well documented with a clear explanation of the 
reason. In cases where there is a need to have information 
on a basis other than the international standard our position 
is that we should publish statistics on both bases. The 
headline figure would still reflect international standard as 
increasingly the Australian situation is being compared with 
that of other countries and it is important that it is done on 
a comparable basis. For example, this approach is being 
taken to satisfy the demand for underemployment data and 
to reduce criticisms of the ILO unemployment definition. 

There is a tension that needs to be managed but if we are 
serious about the importance of international comparisons 
it is imperative that international standard is the main 
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guiding light in developing the concepts, sources and 
methods used in Australia. For these reasons we regard it as 
a priority to make a significant contribution to the develop- 
ment and revision of international standards. 


5. CONCLUSION 


We would all agree that attention to quality is a 
fundamental aspect of our operation. In this paper, we have 
attempted to show that there are many dimensions to 
quality. This same message is clear from the frameworks 
for quality that have been developed by other organizations, 
such as the IMF, Statistics Canada and Statistics Sweden. 
The consequence is that a quality organization depends on 
the actions of all its staff as all can have an impact on 
quality in one way or another. It cannot be left to a work 
group with designated responsibility for quality. Therefore, 
quality can only happen if there is a genuine quality culture 
within the organization. The paper attempts to describe how 
we achieve this within the ABS. Nevertheless, it is 
important to have someone who performs the role of the 
corporate conscience on quality. We have given this respon- 
sibility to the Methodology Division and made the Chief 
part of the ABS Executive team so that it is easier for key 
messages to be conveyed to the senior managers. Among 
other things they draw attention to the most important risks 
to quality or behaviours they see as contrary to our 
corporate objectives. 
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Model Explicit Item Imputation for Demographic Categories 


YVES THIBAUDEAU' 


ABSTRACT 


We propose an item imputation method for categorical data based on a MLE derived from a conditional probability model 
(Besag 1974). We also define a measure for the item non-response error that is useful to evaluate the bias relative to other 
imputation methods. To compute this measure, we use Bayesian iterative proportional fitting (Gelman and Rubin 1991; 
Schafer 1997). We implement our imputation method for the 1998 dress rehearsal of Census 2000 in Sacramento, and we 
use the error measure to compare item imputations between our method and a version of the nearest neighbor hot-deck (Fay 
1999; Chen and Shao 1997, 2000) at aggregate levels. Our results suggest that our method gives additional protection 
against imputation biases caused by heterogeneities between domains of study, relative to the hot-deck. 


KEY WORDS: Nearest Neighbor; Conditional probability approach; Bayesian iterative proportional fitting. 


1. INTRODUCTION AND BACKGROUND 


Let S represent a demographic categorical count 
requested from a census, or needed to compute a survey 
statistic, and suppose S can be computed from the records 
of a survey file f, when the records are complete. Also, 
suppose f is ordered in such a way that proximity in the 
order of f corresponds to geographical proximity. Consider 
the situation where f includes records with unreported 
items. We propose to estimate S with d (A(f)), where 
A(f) is an imputation method that produces a complete 
survey file, and d(-) estimates S by replacing the un- 
reported items with their values imputed with A(f). A(f) 
is based on a likelihood that models transitions between two 
neighbors in f, and associations between the items to be 
imputed and the relevant domains of study (Cochran 1977, 
page 34) defined by partitions of the population. A(f ) is 
meant as an advantageous alternative to the popular 
sequential hot-deck (Kovar and Whitridge 1995), which is 
a version of the nearest neighbor hot-deck (Fay 1999; Chen 
and Shao 1997, 2000) that attempts to minimize geogra- 
phical distance between a unit with unreported items and a 
suitable imputation donor, while also guaranteeing the 
distributional homogeneity of the observed and the imputed 
items with respect to each domain of study. When the 
domains of a same partition tend not to geographically 
overlap, borrowing imputation items from a near-by 
neighbor preserves homogeneity. But, when small domains 
tend to be dispersed within large domains, the methodo- 
logist faces a dilemma. Then, she must choose between 
hot-deck rules that lead to borrowing the imputed items 
from geographically close units, leaving the possibility of 
imputation biases reflecting the local heterogeneity between 
domains, and domain-specific rules, which guarantee distri- 
butional homogeneity by domain, but may not minimize 
geographical distance. A(f ) is an alternative designed to 
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preserve domain integrity, while also simulating the distri- 
butional profile of an imputation donor sharing some 
characteristics with a geographical neighbor. We motivate 
the design of A(f) with examples and a theoretical 
description. In this section we review a classification of 
current hot-deck methods for item imputation with their 
operating principles, so that we can properly compare them 
with A(f ) in later sections. We also give details on the 
dress rehearsal of Census 2000 in Sacramento, our test bed 
throughout the paper. 

Fay (1999), and Sande (1981) identify the sequential 
hot-deck (SHD) as the first category of hot-decks, which we 
call the “pure” SHD. They add a second category, the 
fixed-cell hot-deck (FCHD), which we call the pure FCHD. 
Fay defines a third category of hot-decks: the nearest 
neighbor hot-deck (NNHD). Chen and Shao (1997, 2000) 
give an abstract definition of the NNHD in terms of a 
measure of proximity | |, based on a covariate x. With the 
NNHD, a “donor” is any unit such that |x, - x,| is mini- 
mal, where x, corresponds to the receiving unit (receiver), 
and x, corresponds to the provider of the imputations 
(donor). By constructing the appropriate measure, and 
defining a suitable x, we recover both the pure SHD and the 
pure FCHD as special cases of the NNHD. The pure SHD 
imputes a receiver item by replacing it with the corre- 
sponding item from the closest unit for which it was 
reported, in the order of f. The pure FCHD relies only on 
the value of variables that we call the class variables to 
divide the units between post-strata that are homogenous 
with respect to the items to be imputed. A donor is chosen 
at random from the same post-stratum as that of the 
receiver, irrespective of the order of f- 

Fay (1999), and Fay and Town (1998) propose the 
concept of exchangeability to validate the NNHD. For 
categorical data two units in fare exchangeable if they are 
uncorrelated and identically distributed, given the 
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information available prior to imputing. The operational 
assumption of the NNHD is that a unit and its nearest 
neighbor(s) are exchangeable. For the pure SHD it means 
two contiguous units in f are exchangeable. For the pure 
FCHD it means that units sharing the same values for their 
class variables anywhere in f are exchangeable. We define 
a third instance of the NNHD, which we call the hybrid 
sequential hot deck (HSHD). To guarantee exchangeability 
the HSHD requires proximity both in terms of the order of 
f, and in terms of the class variables. 

We use the term “nearest neighbor’ in the abstract sense 
of the NNHD, unless specified otherwise. We use the terms 
“closest neighbor” to designate the nearest neighbor of the 
pure SHD, and “closest complete neighbor” to mean the 
survey unit with no unreported items that is closest in the 
order of f. In the case of the Sacramento dress rehearsal, the 
Census Bureau uses a HSHD to estimate householder counts 
by tenure, race, origin (Hispanic origin), and sex. The house- 
holder, usually an adult, is unique for each housing unit, and 
is determined by the ages, relationships, and order of the 
persons on the census questionnaire. The HSHD substitutes 
unreported items with the values of these items corres- 
ponding to the last householder who reported them and is in 
the same post-stratum (Treat 1994). The sorted order of f 
maintains the proximity of geographical neighbors. The 
intent behind the HSHD is to define nearest neighbors who 
are close, both in geography and “in kind”. Throughout the 
paper, we continue to use the term householder, although its 
meaning may extend to a generic survey unit. 

The design of the HSHD is well suited for item impu- 
tation in populations geographically clustered by domain. 
Then the need for class variables is limited. But difficulties 
arise when the geographical boundaries between the 
domains begin to blur. Designing a HSHD with good 
discrimination power in those conditions is an attempt at 
walking a fine line between specifying enough class 
variables to account for heterogeneities between domains, 
and specifying too many, which could yield post-strata so 
narrowly defined in terms of domain that they don’t capture 
the local geographical character of the receivers. Compli- 
cating the situation is the fact that the demographic compo- 
sition of the population may change as the geography 
changes, and thus a particular scheme for the HSHD might 
need to be revised, as the geography changes. In the face of 
these difficulties A(f) is innovative in the sense that, 
instead of searching for an ideal nearest neighbor, it gene- 
rates imputations through a model-based simulation that 
integrates information relating to the local geography, as 
well as to domain partitions. A(f ) integrates both kind of 
information by calibrating the parameters of a log-linear 
model on the basis of the strength of the correlations 
between the covariates and the variables subject to impu- 
tation. Our parameter estimation strategy is the same as that 
of Zanutto and Zaszlavsky (1995a, b). However, because 
they have access to a representative sample of complete 
non-respondents, these authors can obtain estimates of the 
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imputation probabilities by implementing a one-step EM 
algorithm (Dempster, Laird and Rubin 1977). In our 
situation, we don’t assume access to a representative 
sample, and we implement the full EM algorithm. Impli- 
citly we make an assumption of items “missing at random” 
(MAR) (Little and Rubin 1987, page 16). 

To analyze the results obtained with A(f), and to 
compare them with those of the HSHD, we derive error 
measures related to A(f ) based on approximations com- 
puted using a Bayesian algorithm first introduced by 
Gelman and Rubin (1991). There are fundamental objec- 
tions to Bayesian methodologies. Fay (1992) shows that 
variance estimation based on multiple imputations (Rubin 
1996) can lead to inflated estimates of variance, whereas in 
the same situation the jackknife estimator (Rao and Shao 
1992) avoids biases. Meng (1994) suggests that Fay’s 
example stems from a poor communication between an 
imputer who has specific model information, and an analyst 
who only has knowledge of the estimation process. In the 
language of Meng, this situation is uncongenial. While 
requirements for coordination between imputer and analyst 
are restrictive, imputation based on exchangeability also has 
dangerous pitfalls, as we show in section 2. In addition the 
Bayesian approach allows for asymptotic approximations of 
error measures through mechanical algorithms, while a 
strict frequentist approach might require tedious 
expansions, as we show in section 5. 

Our objective is to present A (f ), and to show its compa- 
rative advantages over the HSHD, using the Sacramento 
dress rehearsal as an example. In this case f contains records 
for the 138,271 physically enumerated householders 
(Kostanich 1999), of whom 90,156 returned a census 
questionnaire by mail or were visited by an enumerator at 
a first attempt, and 48,115 were selected in a sample. We 
implement our method at the level of the tract, a connected 
unit of geography containing on average 1,300 house- 
holders in f. 

The paper is organized as follows. In section 2 we 
illustrate the difficulties of designing a HSHD methodology 
that guarantees exchangeability. In section 3, we define 
A(f), and in section 4 we present a likelihood for the 
model parameters. In section 5, we show how to implement 
A(f ) and derive a measure of error to make comparisons 
with the HSHD. Section 6 presents and motivates the basic 
model for the dress rehearsal, and section 7 gives results for 
both A(f) and the HSHD in this case. In section 8, we 
summarize the differences and we make recommendations. 


2. ASSESSING EXCHANGEABILITY WITH 
RESPECT TO A PARTITION BY 
DOMAINS OF STUDY 


We illustrate the difficulties inherent in designing a 
HSHD that preserves exchangeability between domains of 
study (Cochran 1977, page 34) with an example, where 
tenure (ownership) is the measurement of interest, and the 
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relevant domains of study are defined by race. To impute 
tenure, the Census Bureau uses the class variable “house- 
hold type” to post-stratify fin five post-strata defined by the 
presence/absence of a live-in spouse for the householder, 
and the size of the household (1, 2, 3+) (Wilson 1998). The 
intent is to define post-strata that establish distributional 
homogeneity in terms of ownership at the level of the 
post-stratum, rendering the domain boundaries of a relevant 
partition uninformative within each post-stratum. 

We examine the post-stratum comprising all the house- 
holders without a live-in spouse, and living in households 
of 3 or more. We call it post-stratum 3. For the purpose of 
this example, we have removed from f all the householders 
with unreported tenure, and each nearest neighbor is exclu- 
sive to a single householder. Table 1 gives householder 
frequencies for eight exhaustive race-tenure categories for 
post-stratum 3. Table 1 also gives the rate of ownership for 
their nearest neighbors, cross-classified by their race and by 
the same eight race-tenure categories of the corresponding 
householders. We observe that, on average, when a 
householder is either in the Black-owner or in the Black- 
renter category, his nearest neighbor is at least 25% more 
likely to be an owner when this nearest neighbor is White, 
than when he is Black. It is tempting to explain this differ- 
ential rate by geographical differences. However, table 2, 
which shows the rates of ownership of the householders in 
post-stratum 3, cross-classified by their own race and that 
of their nearest neighbors, reveals that in fact Blacks with 
White nearest neighbors have a slightly lower rate of 
ownership than Blacks with Black nearest neighbors. What 
this means is that, if the probability of not reporting tenure 
is constant for all Blacks, then imputing their tenure by 
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substituting the tenure of their nearest neighbor over- 
estimate ownership for Blacks in post-stratum 3. 

These distributional disparities between householders 
and their nearest neighbors reflect a lack of exchange- 
ability. A McNemar test leads to a formal rejection of the 
exchangeability hypothesis. There are 1,784 Black house- 
holders with White nearest neighbors. In 1,187 instances, 
tenure is tied. Among the 597 non-tied cases, the owner is 
White in 396 cases. Under the exchangeability hypothesis, 
ownership goes to either race with probability one-half. 
But the proportion of Whites among the owners is eight 
standard deviations above one-half. This example illustrates 
the difficulties in designing a valid NNHD that maintains 
exchangeability. In the next section we present our impu- 
tation method, which is devised for this type of situation. 


3. AN IMPUTATION METHOD BASED ON 
DEMOGRAPHIC TRANSITION PROBABILITIES 


Besag (1974) describes the conditional probability 
approach to spatial processes. This approach gives a frame- 
work for probabilistically modeling the values of “sites”, in 
terms of the values of their “neighbours” to construct a 
spatial process. Besag (1974) also suggests making a 
unilateral approximation to simplify this construction. 
Then, the value of each site depends only on a finite 
number of “predecessors”. This approach is natural in our 
situation since f provides a unilateral ordering of house- 
holders who play the roles of sites and predecessors, in turn. 
Specifically, we construct a first-order process where each 
householder is a site, and the complete closest neighbor is 


Table 1 
Number of Householders and Rates of Ownership of the Nearest Neighbors in Post-Stratum 3 by Race of the Nearest Neighbor 
and Joint Race and Tenure of the Householder 


Race-Tenure Category of the Householder 


White White Black Black Asian Asian Other Other 
Owner Renter Owner Renter Owner Renter Owner Renter 
Number of Householders in Post-Stratum 3 3,347 5,197 1,319 3,630 872 1,196 681 12637 
Rate of Ownership of the White Nearest Neighbors O5564" 0.5647 "0562" "0.299 0.561 0.287 0.540 0.163 
Rate of Ownership of the Black Nearest Neighbors 0.379 0.189 0.427 0.211 0.443 0.202 0.471 0.158 
Rate of Ownership of the Asian Nearest Neighbors 0.589 0.332 0.667 0.320 0.668 0.262, 0.9355 0.302 
Rate of Ownerships of the Other Nearest Neighbors 0.423 0.251 0.497 O37. 0.595 0.177 0.463 0.152 
Table 2 


Rates of Ownership of the Householders in Post-Stratum 3 by 
Race of the Householder and Race of the Nearest Neighbor 


Rate of Ownership of the White Householders 
Rate of Ownership of the Black Householders 
Rate of Ownership of the Asian Householders 
Rate of Ownership of the Other Householders 


Race of the Nearest Neighbor 


White Black Asian Other 
OAS 28035820 0.554) 20333 7 
257 tO 2 64ers 0304. rl) 26g, 
0.441 0.441 0.400 0.360 
O309) 9 0297, LOS 37 101234 
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its only predecessor. In this set-up, the value of a site is the 
state of a householder, which we define shortly. We refer 
to the conditional probability for the value of a site given 
that of its predecessor as the transition probability from the 
state of the closest complete neighbor to the state of the 
householder. Our imputation methodology is based on the 
MLE of the transition probabilities at the level of a tract. In 
this section we describe the imputation methodology, and 
in the next section we introduce a likelihood for the transi- 
tion probabilities. 

Consider a population of householders in f representing 
a tract. Let ’ represent a set of C categorical variables that 
characterize each householder. The variables are labeled 
1,...,C, and have respectively K,, .... Ke categories. Let 
Y* denote the Cartesian product of the categorical 
variables in ¥. Then, ‘¥™ is the state space of the house- 
holder and has K states, where K =II,.,, K,. Similarly, let © 
be the set of E categorical variables defining the closest 
complete neighbor in f The variables are labeled 1, ..., E, 
and have F’, ..., F , categories. &™ is the state space of the 
closest complete neighbor and has F states, where 
F =II,.. F;. The items represented in = are also repre- 
sented in ‘VY. Let the state of the householder be s € ¥%, 
where s is a vector whose components represent the 
variables in Y. Similarly, t € &* is the state of the closest 
complete neighbor. Under the assumptions above, let 
P(s|t) represents the transition probability from¢t tos in 
the order of f, Now suppose a householder only reported the 
categorical variables in a subset Zc VY. Let v € Z™ be the 
vector of reported variables. Let o('¥, Z,v) < Y be the 
subset containing all the values of s, such that s agrees with v 
on the variables in Z. Define 


Fibsit ins a HORS 
Y Plt)’ Sania eal les (1) 


ueo(¥,Z,v) 


P(s|t,Z,v) = 


To impute the items in the set difference V -Z 
according to A(f ), we roll dice weighted by the values of 
the MLE of P(s|t, Z,v), for each householder in marginal 
state v and with closest complete neighbor in state ¢. Under 
our assumptions, the MLE of P(s|t, Z, v) contains all the 
information available from f on the unreported items. In the 
next section we formulate a likelihood for P(s|t, Z, v). 


4. A LIKELIHOOD FOR THE TRANSITION 
PROBABILITIES 


Let N(t, Z, v) be the number of householders who only 
reported the items defining the marginal state v involving 
only the items in Zc ¥, and with closest complete neigh- 
bor in state t. Let N be a vector with the N(¢, Z,v)’s as its 
components, at the level of a tract. Let P = [P(s|t)] be the 
vector comprising the P(s|t)’s ordered lexicographically 
by f and s. Based on the assumptions described above, we 
have the following likelihood for the transition proba- 
bilities. 


eae Ug | Oe eos 


te=* Let veZ* seo(¥,Z,v) 


P(s | 


| es | N (2) 


The running indices in (2) are t, Z, v, ands. If every item 
is reported, then Y is the only instance of Z with 
N(t, Z,v) #0, for some ¢ and y. In that case (2) is analo- 
gous to the likelihood of the transition probabilities of a 
first-order Markov chain (Bishop, Fienberg and Holland 
1975 page 263). In general, we model ©, as a log-linear 
subspace. For this purpose it is more convenient to work 
with an expression equivalent to (2) that has a simpler 
algebraic representation. We introduce the nuisance 
parameter U = [U (£)], where U is a probability vector, that 
is ) nx U(t)=1, and 0<U@<1, for all te=*. U 
represents the prevalences of the states of the closest 
complete neighbors. Let Q(s,t)=U(t)xP(s|t), and 
Q =[Q(s,t)]. Then @Q is a probability vector with KxF 
components lexicographically ordered by t ands. We set 
up ©, the parameter space of Q, as a hierarchical log-linear 
model (Agresti 1990, page 143; Bishop, Fienberg and 
Holland 1975, page 67). Then, if we design © so that it 
includes the interactions of all orders between the variables 
in &, (2) is equivalent to the following likelihood in terms 


of 2. 
L;0)- TD I O ( ‘o 


tea* Zc reZ™ \ seo(¥,Z,r) 


NGZ,r) | 


26.0) 


Qe_eO. (3) 
That is, if @ has the architecture described above, a 
specific choice for © unambiguously defines ©, in (2), 
and since the items of the closest complete neighbor are 
always reported, the factorization L(N; P) = 
L*(N;Q)xR(N;U) holds, for some R(;). (3) is easier 
to manipulate than (2) since it corresponds to the likelihood 
of the cell probabilities associated with a partially classified 
contingency table (Little and Rubin 1987, page 181). Under 
mild conditions on the non-response mechanism (for 
example, strictly positive and constant probabilities for each 
response configuration (Thibaudeau 1988)) the likelihoods 
in (2) and (3) are identifiable and asymptotically unimodal. 
Multimodality is theoretically possible for finite samples, 
but it does not appear to occur in the cases studied in the 
paper, where the proportions of unreported items are small. 


5. FINDING THE MLE AND DERIVING 
MEASURES FOR THE NON-RESPONSE ERROR 


In this section, we recall how to compute P, the MLE of 
P, and we derive measures of errors for A(f) and another 
predictor S(s), which we term the “MLE” of the expected 
value of S(s), which is the actual count of householders in 
state s at the tract level. An error measure for § (s) will be 
useful in section 7 to evaluate the imputation results 
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obtained with A(f) relative to those with the HSHD. We 
compute P by maximizing (3), in terms of Q, with the EM 
algorithm. Because of the factorization described in section 
4, this maximum also yields P. 

To derive measures of error in predicting S(s) for a 
given s, consider all the triples of the form (¢, Z, v) in (1) 
that are observed in the sample (i.e, the tract) for which it is 
possible, but due to item non-response it is not known, that 
one or more householders corresponding to such a triple are 
in state s. Let A(s ) be the number of such triples. We index 
these triples with A = 1,..., A(s). Let 6(1) be the number 
of householders corresponding to triple 1, and let p, (s) be 
the probability that such a householder is indeed in state s, 
where p, (s) is derived from P. Let A(s,i) be the unknown 
number of householders who are indeed in state s among 
the 6(1) candidates. Based on our model we have S(s) = 
Soets) i 2 ye A(s,), where S,.(s) is the number of 
householders who reported being i a state s and A(s,) is 
SAD Ce p,(s)). Furthermore, let S(s)=S obs) + 

ee 1 5(A)6, (Ss), where f, (s) is the MLE of p,(s). If we 
fear the 4.’s as independent predictors, like in a regression 
situation, and since P is asymptotically normal with mean 
P, we have the following large sample approximation for 
the MSE of § (s) in predicting S(s). 


2 
|P 


A(s) 
YS 5 A)A, (s) -AG, A) 
r=1 


A(s) 


Y 5A) p,(s)|P 
rA=1 


A(s) 


+V| ¥° A(s,A)|P]. (4) 
A=1 


Let V, and V, be the first and second variances on the 
RHS of (4). Gelman and Rubin (1991), Larsen (1996), and 
Schafer (1997, page 324) introduce data augmentation 
Bayesian iterative proportional fitting (DABIPF) to simu- 
late posterior and predictive distributions associated with 
log-linear models with data missing at random. We can use 
DABIPF to approximate model-consistent estimators for 
V, and V, + V, through simulations of the posterior dis- 
Scnten + ys AG) § 9) p, (s) and the predictive distribution 
of S(s) respectively. Furthermore, we approximate the 
MSE of the demographic counts obtained imputing with 
A(f) by adding another V, to V, + V, in (4) to account for 
the additional noise of the “dice roll” involved in A(f). 


6. MODELING AND SENSITIVITY ANALYSIS 


6.1 A Conditional Independence Model for 
Sacramento 


Using the notation of section 3, the householder 
variables in YY are race, origin, tenure, and sex. The cate- 
gories for race are White, Black, Asian, and Other. For 
origin they are Hispanic and non-Hispanic. For tenure they 
are owner and renter. For sex they are male and female. 
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The neighbor variables in = are race, origin, and tenure. 
The categories for race of the neighbor are Black and 
non-Black. The categories for origin and tenure of the 
neighbor are the same as for the householder. We design © 
in (3), by selecting interactions between the variables in VY 
and &. To ensure equivalence between (2) and (3), we 
select the interactions of all orders between the variables in 
=. We attempt to maintain through the imputations the 
correlation between successive householders in fin terms of 
each item in =. Thus we include each interaction asso- 
ciating an item in © to the corresponding item in ‘V. We 
complete the model by selecting consistency associations: 
We include the six interactions representing the associa- 
tions involving a pair of items in . The resulting contin- 
gency table has 256 cells, and the log-linear model has 
thirty free parameters. 

This model leads to a conditional independence tran- 
sition structure. For example, conditional on the race of the 
closest complete neighbor, the race of the householder is 
independent of the tenure of the closest complete neighbor. 
Conditional independence allows us to combine neighbor 
information obtained from multiple neighbors to produce a 
synthetic closest complete neighbor. This approach ensures 
that we can use all the information available from the 
closest neighbor, even if he is not complete. With this 
approach, the correlation structure among the items of the 
householder is maintained whenever only one item per 
householder is imputed. In Sacramento, among 138,271 
householders, approximately 0.1% did not report sex, 3.5% 
did not report race, 2.9% did not report origin, and 7.6% did 
not report tenure. Furthermore, race and origin are missing 
jointly for 0.49% of the householders, race and tenure 
0.48%, origin and tenure 0.69%. Given these low rates of 
jointly missing items, we expect our model to do well. 


6.2 Sensitivity Analysis and Evaluation 


In section 7 we use the standard error of the predictive 
distribution of S(s) to approximate /V,+V,, the error 
of S(s) in predicting S(s), as derived in (4), and we 
assume asymptotic normality of S(s) - S(s). The accuracy 
of this approximation depends on the accuracy of the 
approximation of the distribution of the MLE P with the 
posterior distribution of P. This later approximation is 
accurate asymptotically when the model holds, but we still 
need to verify the extent to which this asymptotic result is 
applicable when the sample is finite. To do so we examine 
the sensitivity of the posterior distribution of P under prior 
changes. A low sensitivity implies that the posterior distri- 
bution of P is a good approximation of the distribution of 
P. We focus on the posterior distribution for the condi- 
tional probability that origin is Hispanic, conditional on 
each race. An increase of .1 in the value of a, the prior para- 
meter of the constrained Dirichlet family (Schafer 1997, 
page 346), which is the natural family for (3), is equivalent 
to observing three additional Hispanics and three additional 
Non-Hispanics of each race. Table 3 gives the posterior 
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modes and standard deviations (SD) of the posterior density 
of the conditional probability that origin is Hispanic given 
each race, for four choices of a, for a specific tract X. 
Figure 1 shows the posterior of the conditional probability 
given race is White. This posterior is stable under prior 
disturbances and we expect it to give a good approximation 
for the distribution of the corresponding MLE. On the other 
hand, Figure 2, which shows the posterior of the conditional 
probability given race is Black, displays a high sensitivity, 
suggesting that our proposed asymptotic approximation is 
less accurate in this case. This is not surprising in light of 
the facts that, for Blacks, the MLE of the conditional proba- 
bility is close to 0 and the domain (race) size is smaller 
(among the 1,583 householders in tract X, there are 1,087 
Whites, 179 Blacks, 56 Asians, 172 Others, while 89 did 
not report race). In the next section we focus on cases 
where the conditional probabilities are not near 0 or 1, and 
the size of the domain is large. We retain the choice a = .01 

for the prior, which is approximately Jeffrey’s prior on the 
marginal conditional probabilities that define the model. It 
is beyond the scope of the paper to address the difficulties 
when the domain is small and/or the MLE is near 0/1. 


Table 3 
MLE, Posterior Mode (approximate), and Standard Deviation for 
the Conditional Probabilities of Origin Being Hispanic Given 
Race for Four Choices of Prior Distribution 


Race MLE Mode 
a=.01 


S.D. Mode 
= Olea—1 


S.D. Mode 
O=sera='5 


S.D. Mode _ S.D. 
C= SiO! a=1 


White .1784 .178 .01195 .184 .01247 .180 .01219 .188 .01186 
Black .07428 .0690 .02272 .081 .02330 .120 .02428 .160 .02782 


Asian .09113  .105 .04086 .108 .04550 .195 .04881 .276 .04952 


Other .9662 .966 .01171 .964 .01347 .950 .01495 .930 .01666 
4000 
3000 
F 
= 
e 2000 
q 
1000 
0 
0.15 0.20 0.25 


Probability 


Figure 1. Posterior Distribution 
Prob. Origin is Hispanic - White Householder 


Freq 
4000 


3000 


2000 


1000 


0.05 0.10 0.15 0.20 0.25 0.30 
Probability 


Figure 2. Posterior Distribution 
Prob. Origin is Hispanic — Black Householder 


7. RESULTS FOR THE SACRAMENTO DRESS 
REHEARSAL 


Table 4 gives count estimates at the level of Sacramento 
derived with A(f ) based on the model of section 6.1 fitted 
for each of the 102 tracts, as well as count estimates 
obtained with the HSHD. Table 4 also gives error measure- 
ments based on a sequence of 2000 DABIPF iterations with 
2000 burn-in iterations, for each of the 102 tracts in 
Sacramento (see appendix A for convergence), serving to 
approximate VV, + Ve derived from (4). We call JV, = yy 


the prediction error of the MLE. We estimate ,/V, sepa- 
rately by “rolling dice” loaded with the MLE. We call ,/V, 
the model residual error. We use ,/2 V, + V,, which we call 
the total imputation error, to express the error of A(f ) in 
estimating the true count. If we assume S$ (s) is positively 
correlated with the HSHD, the prediction error of the MLE 
can be used as an upper bound for the standard error of the 
distance between the count estimates corresponding to the 
MLE and the HSHD. For the Black owners, this distance is 
severely incompatible with the hypothesis that the MLE and 
the HSHD have the same expectation. This is no surprise in 
light of the results of section 2. 

Interestingly, the results of table 4 can serve to improve 
the performance of the HSHD. Since tenure is unreported 
twice as often as race, our results for the Black owners 
suggest improving the HSHD by including race as a class 
variable for the imputation of tenure with the HSHD. Table 
5 shows results obtained with this re-engineered HSHD, 
and exchangeability of tenure between nearest neighbors 
based on this new post-stratification is more plausible than 
for the original scheme. 
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Table 4 


Population Counts and Uncertainty Measures for Sacramento 


All 


White 
Black 
Asian 
Other 
Hispanic 


Non-Hispanic 


Owner 


Renter 


White Hispanic 


White Non-Hispanic 
Black Hispanic 


Black Non-Hispanic 


Asian Hispanic 


Asian Non-Hispanic 


Other Hispanic 


Other Non-Hispanic 
White Owner 
White Renter 
Black Owner 
Black Renter 
Asian Owner 


Asian Renter 
Other Owner 
Other Renter 


Hispanic Owner 
Hispanic Renter 
Non-Hispanic Owner 
Non- Hispanic Renter 


Imputed Count Imputed Count MLE ofthe Model Residual Prediction Error Total Imputation 
With HSHD With Model 


138,271 
89,032 
19,962 
17,405 
11,872 
21,024 

117,247 
70,054 
68,217 

9,068 
79,964 
605 
19351) 
518 
16,887 
10,833 
1,039 
47,722 
41,310 
7,661 
12,301 
9,810 
1595 
4,861 
7,011 
9,409 
11,615 
60,645 
56,602 


Table 5 


138,271 
88,914 
19,943 
17,421 
11,993 
21,050 

117,221 
70,022 
68,249 

8,972 
79,942 
612 
19,331 
515 
16,906 
10,951 
1,042 
47,167 
41,147 
7,538 
12,405 
9,853 
7,568 
4,864 
7,129 
9,434 
11,616 
60,588 
56,633 


HSHD with Race as an Additional Class Variable 


Imputed  Imputed 
Count with count with 
HSHD MHSHDre- 
engineered 
with Race 
as a Class 
Variable 
White 
owner 47,722 47,687 
Black 
Owner 7,661 USS: 
Asian 
Owner 9,810 9,851 
Other 
Owner 4,861 4,840 
Owner 70,054 69,951 


Imputed MLEof Prediction 


Count the 
with Expected 
Model Count 


47,767 47,770.5 


759381) 7 7,542.3 
OBS Obie 
4,864 4,840.7 


70,022 70,026.3 


Error of 
the MLE 


41.3 


20.7 


18.6 


Ds os 
43.3 


Expected Count Error of the MLE Error 
138,271.0 0.0 0.0 0.0 
88,927.7 = [Es 5.2 47.2 
19,952.9 14.9 16.5 22.3 
17,426.2 14.0 14.9 20.5 
11,964.1 29.8 Exe) 44.8 
21,038.1 10.3 10.6 14.7 
117,232.8 10.3 10.6 14.7 
70,026.3 42.8 43.3 60.9 
68,244.7 42.8 43.3 60.9 
8,991.1 29.9 33.6 45.0 
79,936.6 15.4 2 22.0 
608.6 11.0 12.6 16.7 
19,344.3 10.8 10.7 {5.2 
oP (oe) 10.0 1 es 152 
16,909.7 10.4 10.3 14.6 
10,921.9 29.7 335 44.6 
1,042.3 3.5 3.4 4.9 
47,770.5 37.8 41.3 56.0 
41,157.3 39.0 41.4 56.9 
7,542.3 19.6 20i7, 28.5 
12,410.6 21.1 225 30.8 
9,872.8 18.4 18.6 26.1 
7,553.4 18.2 18.8 26.1 
4,840.7 24.4 252 jf pte) 
7,123.4 25.4 28.6 38.2 
9,402.2 19.5 20.9 28.6 
11,629.9 20:4 21.4 29.4 
60,618.0 38.9 39.4 55.4 
56,614.8 Sar. 39.6 55.4 


8. CONCLUSION 


In section 2 we have shown that the HSHD may fail to 
retrieve exchangeable householders, producing a bias 
relative to a situation where exchangeability holds. As more 
evidence that A(f ) partly corrects this relative bias, we 
compare the observed and the imputed cross-product ratios 
(Bishop, Fienberg and Holland 1975, page 14) between two 
races (Black, White) and the two tenures. We look at the 
cross product ratio involving: 


1. Only observed householders. 
2. Householders with tenure imputed with the HSHD. 


3. Householders with tenure imputed with A(f). 
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There are 73 tracts where all these cross-product ratios 
can be measured. 2. The HSHD produces cross-product 
ratios smaller than those observed for 53 tracts. A(f) 
displays more symmetry as it produces cross-product ratios 
smaller than observed only for 43 tracts. A sign test 
confirms that A(f)(p =.064) is more in sync with the 
observations than the HSHD (p =.0001). 

In general, we expect the HSHD to give good count 
estimates when the householders tend to geographically 
coalesce by domain of study. But difficulties arise in a 
situation where domains of study exhibiting substantial 
distributional dissimilarities are geographically integrated. 
In such a situation, implementing the HSHD requires accu- 
rate parsing of the class variables. Frugality is tantamount 
when specifying class variables, but at the same time the 
price to pay for omitting a crucial variable can be sub- 
stantial. Thus the designer of the HSHD has little room for 
error. By contrast, although model misspecification cer- 
tainly remains a danger, the user of A(f) has more 
freedom to posit several domain partitions without im- 
peding on the ability of A (f ) to adjust the imputations for 
the local geographical character, based on information from 
the closest complete neighbor. A(f) will be useful to 
impute categorical measurements when the impact of the 
relevant domain partitions on the measurements is not 
known a priori, and some of the relevant domains may 
define small subpopulations dispersed within the entire 
population. Then, based on policy considerations, A(f ) 
can be applied directly, or to help parse the class variables 
of the HSHD, as we did in section 7. 

A referee notes that a comparison with a procedure based 
on an unbiased sample, building on the method of Zanutto 
and Zaslavsky (1995a,b), would be a defining test for 
A(f ). This procedure would require collecting information 
from the item non-respondents on a scale sufficiently large 
to ensure bias detection, and we should take advantage of 
any such opportunity to perform a test of this type. Unfor- 
tunately, because of limited resources, samples containing 
this information are seldom collected. Nevertheless, we are 
hopeful that the analysis of the returns from Census 2000 
aided with procedural information can provide new insights 
on the reliability of A(/). 


ACKNOWLEDGEMENTS 


The author is indebted to William Winkler for his 
guidance. The author is grateful to two referees for their 
discernment, to Eric Slud, Don Malec and Joseph Schafer 
for essential discussions, and to Andrew Gelman and Don 
Rubin for providing their unpublished paper. This paper 
reports the results of research and analysis undertaken by 
Census Bureau staff. It has undergone a more limited 
review than official Census Bureau publications. This 
report is released to inform interested parties of research 
and to encourage discussion. 


APPENDIX A —- CONVERGENCE OF DABIPF 


We ran two chains of 8,000 iterations each, with over- 
dispersed starting points, for the case a = 0.01, for tract X. 


We computed VR (Gelman and Rubin 1992) for Q (s, t) in 
(3), for sequences of 1,000, 2,000, and 4,000 iterations, 
after burn-in lags of 1,000, 2,000, and 4,000 iterations 
respectively. After 2,000 iterations, with 2,000 burn-in 
iterations, we observed that JR < 1.010 in all studied cases, 
including those in table 3. We think this level of accuracy 
is acceptable for approximating modes and variances. 
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A Hierarchical Bayesian Nonignorable Nonresponse Model for 
Multinomial Data from Small Areas 


BALGOBIN NANDRAM, GEUNSHIK HAN and JAI WON CHOI’ 


ABSTRACT 


The analysis of survey data from different geographical areas, where the data from each area are polychotomous, can be 
easily performed using hierarchical Bayesian models even if there are small cell counts in some of these areas. However, 
there are difficulties when the survey data have missing information in the form of nonresponse especially when the 
characteristics of the respondents differ from the nonrespondents. We use the selection approach for estimation when there 
are nonrespondents because it permits inference for all the parameters. Specifically, we describe a hierarchical Bayesian 
model to analyze multinomial nonignorable nonresponse data from different geographical areas, some of them can be small. 
For the model, we use a Dirichlet prior density for the multinomial probabilities and a beta prior density for the response 
probabilities. This permits a “borrowing of strength” of the data from larger areas to improve the reliability in the estimates 
of the model parameters corresponding to the smaller areas. Because the joint posterior density of all the parameters is 
complex, inference is sampling based and Markov chain Monte Carlo methods are used. We apply our method to provide 
an analysis of body mass index (BMI) data from the third National Health and Nutrition Examination Survey (NHANES 
III). For simplicity, the BMI is categorized into three natural levels, and this is done for each of eight age-race-sex domains 
and thirty-four counties. We assess the performance of our model using the NHANES III data and simulated examples, 
which show our model works reasonably well. 


KEY WORDS: Latent variable; Metropolis-Hastings sampler; Nonignorable nonresponse; Selection approach; Small area. 


1. INTRODUCTION 


(1998) used a Bayesian hierarchical model for the probabil- 


The nonresponse rates in many surveys have been 
increasing steadily (De Heer 1999; Groves and Couper 
1998), making the nonresponse problem more important 
For many surveys the responses are polychotomous. For 
example, in the third National Health and Nutrition 
Examination Survey (NHANES III), we can estimate the 
proportions of persons belonging to three levels of body 
mass index (BMI), although BMI is a continuous variable. 
The purpose of this paper is to describe a new hierarchical 
Bayesian model to study nonignorable multinomial non- 
response for small areas, and to apply it to the NHANES Ii 
BMI data. 

Rubin (1987) and Little and Rubin (1987) describe two 
types of models which differ according to the ignorability 
of response. In the ignorable nonresponse model the 
distribution of the variable of interest for a respondent is the 
same as the distribution of that variable for a nonrespondent 
with the same values of the covariates. In addition, the 
parameters in the distributions of the variable and response 
must be distinct (see Rubin 1976). All other nonresponse 
models are nonignorable. We use both ignorable and 
nonignorable nonresponse models for our data because 
there are no nonrespondents for some domains. 

Crawford, Johnson and Laird (1993) used nonignorable 
nonresponse models to analyze data from the Harvard 
Medical Practice Survey. Stasny, Kadane, and Fritsch 


ities of voting guilty or not on a particular trial when the 
views of nonrespondents differ from those of respondents 
in various death-penalty beliefs. Park and Brown (1994) 
used a pseudo-Bayesian method (Baker and Laird 1988), 
and Park (1998) applied a method in which prior observa- 
tions are assigned to both observed and unobserved cells to 
estimate the missing cells of a multi-way categorical table 
under nonignorable nonresponse. Our approach differs 
from these authors. We describe small area estimation for 
multinomial data, and we use Markov chain Monte Carlo 
methods to implement the methodology. This permits the 
inclusion of all sources of variability in our models. 

There are two approaches to model nonresponse. The 
selection approach is used for the hypothetical complete 
data, and a nonresponse model is added conditional on the 
hypothetical data. This approach was developed to study 
sample selection problems (e.g., Heckman 1976 and Olson 
1980). In the pattern mixture approach the respondents and 
the nonrespondents are modeled separately, and the final 
answer is obtained by a probabilistic mixture of the two. 
We use the selection approach for our problem. 

Stasny (1991) used an empirical Bayes model to study 
victimization in the National Crime Survey, and she fol- 
lowed the selection approach. This analysis pools binomial 
data from several domains, and some of them have small 
counts. Essentially this is an exercise in small area 
estimation. A related method was presented by Albert and 
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Gupta (1985), who used an approximation to obtain a 
Bayesian approach for a population with a single domain 
(see also Kaufman and King 1973). That is, unlike Stasny 
(1991), these latter authors did not perform. small area 
estimation, and their analysis in a single domain do not use 
data from other domains. 

Since the Bayesian approach can incorporate other 
information about nonrespondents, the Bayesian method is 
appropriate for the analysis of nonignorable nonresponse 
(Little and Rubin 1987 and Rubin 1987). However the main 
difficulty is how to describe the relationship between the 
respondents and nonrespondents. Using the selection 
approach within the framework of Bayes empirical Bayes 
(see Deely and Lindley 1981), Stasny (1991) estimated the 
hyper-parameters by maximum likelihood methods and then 
assumed them known, thereby suppressing some variability. 
We extend this approach in two directions. 

First, we consider multinomial data obtained indepen- 
dently from several geographical areas. It is worthy to note 
that Basu and Pereira (1982) considered multinomial non- 
response data from a single domain using a multinomial 
Dirichlet model when the hyper-parameters are assumed 
known. Recently, Forster and Smith (1998) used graphical 
multinomial Dirichlet log-linear models to analyze data 
from the panel survey in British general election. Again the 
hyper-parameters are assumed known, and a model with a 
single domain is used. Secondly, we obtain a full Bayesian 
approach for multinomial nonignorable nonresponse data 
from several areas. We do not estimate the hyper-para- 
meters using the data. 

As a summary, we develop a multinomial nonignorable 
nonresponse model which is used for pooling data over 
many small areas, and we note that it can be used in other 
applications. The rest of the paper is organized as follows. 
In section 2 we describe the NHANES III. In section 3 we 
discuss the Bayesian model for nonignorable nonresponse. 
In particular, a three-stage Bayesian hierarchical multi- 
nomial model is applied to the NHANES III data to investi- 
gate the nonresponse problem. In section 4 we describe an 
analysis of the NHANES III data in which we include a 
regression analysis to combine all the age-race- sex 
domains. In section 5 we describe a simulation study to 
assess the performance of our model. Finally, section 6 has 
the conclusion. 


2. NHANES Ill DATA AND NONRESPONSE 


The NHANES IIL is one of the periodic surveys used to 
assess an aspect of health of the U.S. population (National 
Center for Health Statistics 1994). Our research is 
motivated by nonresponse of body mass index (BMI) in the 
NHANES III. The data for our illustration come from this 
survey, and were collected from October 1988 to September 
1994. In section 2.1 we describe the actual data, and in 
section 2.2 we describe the data we analyze. 


2.1 NHANES III Data 


The NHANES III consists of two parts. The first part is 
the interview of the sampled individuals for their personal 
information and the second part is the examination of those 
sampled. One or more persons from the sampled house- 
holds were placed into a number of subgroups depending 
on their age, race and sex. Some subgroups were sampled 
at different rates. Sampled persons were asked to come to 
a mobile examination center (MEC) for a phyzsical 
examination, Those who did not come were visited by the 
examiner for the same purpose. Details of the NHANES III 
sample design are available (National Center for Health 
Statistics 1992). We incorporate design features associated 
with clustering in our model. 

The main reasons for NHANES III nonresponse are “not 
interested”’, “no time/work conflict’, “concerns/suspicious”’, 
“don’t bother me” and “health reasons”. The nonresponse 
rate of younger individuals is very high because the parents, 
especially older mothers of an only child, were extremely 
protective of their babies, and would not allow them to 
leave their homes for the MECs. Field workers often 
observe that obese persons tend to avoid the medical 
examination. So that nonresponse might be nonrandom and 
hence require some special attention. 

NHANES III data are adjusted by multistage ratio 
weightings for the data to be consistent with the population 
(Mohadjer, Bell and Waksberg 1994). The ratio is the 
proportion of persons in the sample to the number of 
persons who completed interview and examination. 
Weighting with nonresponse ratio is one of these stages. In 
nonresponse ratio estimation, the proportions of non- 
respondents in the multinomial cells are the same as those 
for the respondents (i.e., ignorable nonresponse). In this 
case since the proportions are of interest, no adjustment is 
required. Clearly, this ratio estimation can be incorrect 
when these two groups are different. Therefore there is a 
need to consider the adjustment by a method other than 
ratio adjustment. In this paper we investigate a Bayesian 
method as an alternative to ratio weighting for nonignorable 
nomesponse. 

NHANES III nonresponse also occurs at several levels 
in the survey: interview and examination. The interview 
nonresponse arises from sample individuals who did not 
respond for the interview. Some of those who were already 
interviewed did not come to the MEC, missing all or part of 
the examinations. In this paper, our population consists of 
those individuals who would have agreed to take the phys- 
ical examination in the MECs. Thus, nonrespondents are 
those individuals who agreed to take the physical examina- 
tion, and did not show up at the MECs. More specifically, 
since we are considering item response, the nonrespondents 
are those individuals who agreed to come to the MECs and 


their heights and/or weights were not measured. 
Schafer, Ezzati-Rice, Johnson, Khare, Little and Rubin 


(1996) attempted a comprehensive multiple imputation 
project on the NHANES III data for many variables. The 
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purpose was to impute the nonresponse data to provide 
several data sets for public use. Unfortunately, one of the 
limitations of the project was that “the procedure used to 
create missingness corresponds to a purely ignorable 
mechanism; the simulation provides no information on the 
impact of possible deviations from ignorable nonresponse.” 
Another limitation is that the procedure did not include 
geographical clustering. Our purpose is different; we do not 
provide imputed public-use data. 


2.2 Data Used for Illustration 


Our data have two age groups (younger than 45 years, 
45-, and 45 years or older, 45+), two race groups (white and 
non-white) and, of course, two groups for sex (male and 
female). Thus, there are eight age-race-sex domains. 

One of the variables of interest in the NHANES II is 
BMI, an index of weight adjusted for height (Kg / m7), that 
broadly categorizes obesity within age-race-sex groups 
(Kuczmarski, Carrol, Flegal and Troiano 1997) as low body 
fat (level 1: BMI < 20), healthy body fat (level 2: 20 < BMI 
< 25), hefty or unhealthy (level 3: BMI > 25). We use this 
broad classification for each of the eight age-race-sex 
groups. 

Rather than a categorical data analysis, one can also 
provide an analysis that treats BMI as a continuous variable. 
While some information is lost by discretizing the BMI 
values, an analysis using continuous models for BMI will 
also be approximate and there is a need to search for an 
appropriate transformation. In the final analysis, a doctor 
only needs to know what proportions of the public belong 
to different levels of BMI, so he or she can tell his patient’s 
standing in obesity. 

The analysis of BMI data using categorical data methods 
is not uncommon. For example, Malec, Davis and Cao 
(1999) described a Bayes empirical Bayes analysis of the 
NHANES III data. They classified an individual older than 
20 years as normal if her/his BMI is below a certain gender 
specific threshold. This is an application of a Bayesian 
analysis of binary data. However, their classification is 
somewhat restricted (see Kuczmarski et al. 1997). By 
considering multinomial data, we have generalized the 
analysis of Malec et al. (1999). In fact, they did not provide 
a nonignorable nonresponse model. 

Unlike Schafer et al. (1996), we include clustering at the 
county level, although there is a need to include clustering 
at the household level. For the complete data there are 
6,440 households. Of these households 52.1% contributed 
one person to the sample, 22.5% two persons, and 21.4% at 
least three persons. We have calculated the correlation 
coefficient for the BMI values based on pairing the 
members within households (see Rao 1973 page 199). It is 
0.19 which indicates that as a first approximation the 
clustering within households can be ignored. 

Table 1 shows the number of respondents for each BMI 
level for each age-race-sex domain and 34 counties 
(population at least 500,000). The pattern of respondents 
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differs greatly by age. The nonresponse rate for the older 
group (45+) is negligible. Therefore the main concern about 
nonresponse must be given to the younger group (45-). 
There is also higher response rate among females than 
males. We note that the selection procedure is not random 
over the single population of males and females. 


Table 1 
Number of individuals in each BMI level and number of 
nonrespondents (Non) by age, race and sex over all 34 counties 


BMI 

Age Race _ Sex 1 2 8) Non 
45- W M 1,098 651 597 558 
F 845 434 380 NS. 
B M 1,198 a3 665 574 
F 745 463 524 214 

45+ WwW M 46 439 =: 1,014 3 
F yl 223 365 4 

B M 719 470 942 8 
F 48 169 D2 6 


Note: BMI (1=less than 20; 2 = at least 20 and smaller than 25; 

3 = greater than 25) 

Age (Younger than 45 years = 45-; 45 years or older = 45+) 
Race (White = W; all others = B) 


Sex (Male = M; Female = F) 


Table 2 
Number of individuals in each BMI level and number of 
nonrespondents (Non) for eight examples (Ex) of small 
age-race-sex domains from different counties 


BMI Level 
Ex Age Race __ Sex 1 2, 3 Non 
1 45- W M 1 3} 1 14 
2 F 3 4 1 0 
3 B M 5 5 6 10 
4 F 3 1 1 1 
5 45+ W M 1 2 6 0 
6 F 1 3 4 0 
7 B M 3 3 5 0 
8 F 2 0) 1 1 
Note: BMI (l=less than 20; 2 = at least 20 and smaller than 25; 


3 = greater than 25) 

Age (Younger than 45 years = 45-; 45 years or older = 45+) 
Race (White = W; all others = B) 

Sex (Male = M; Female = F) 


One important aspect of our work is on small area esti- 
mation. Because we consider inference for each age- race- 
sex domain separately over the the geographical areas 
(counties), the samples from some of these areas can be 
very small. Thus, small area estimation techniques are 
required to estimate the parameters corresponding to these 
smaller areas. Specifically, we need to “borrow strength” 
from the larger areas to make the estimates for the smaller 
areas more reliable. Table 2 presents eight examples to 
show the need for small area techniques. We have selected 
eight counties that have small domains; all the cell counts 
are at most 6 and many of them are as small as | (one of 
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them is 0 for 45+). We will present overall estimates and 
the estimates for the first four examples (45-). Note that in 
comparison to the cell counts, the nonrespondents are large 
for two of them (14 and 10 nonrespondents). 

We note that the purpose is not a comprehensive analysis 
of the NHANES III data although it forms an approximate 
analysis for these data. Our method is general enough to 
analyze multinomial nonresponse data from many areas, 
some of which can be small. It is for these small areas that 
we develop this modeling technique. Thus, in this paper we 
use the NHANES III data to illustrate our method. 

Our method considers each domain separately with a 
“borrowing of strength” across the 34 areas (counties) to 
analyze the BMI data. Thus, there are eight separate 
analyses, each with 34 areas, and some of them are small. 
We use a hierarchical multinomial nonresponse model to 
analyze data of this form. The small cell counts, substantial 
nonrespondents and multinomial data make the methodo- 
logy much more practical. Our methodology is also 
extended to incorporate all the domains simultaneously 
through logistic models. 


3. METHODOLOGY FOR HIERARCHICAL 
MULTINOMIAL MODEL 


We propose a model for each of the eight age-race-sex 
domains but for all counties taken simultaneously. How- 
ever, the models fall into two broad classes. We will use a 
nonignorable nonresponse model for the younger group and 
an ignorable nonresponse model for the older group since 
the nonresponse rate for the older group is negligible. Of 
course, it is worthwhile to compare the ignorable non- 
response model and the nonignorable nonresponse model 
for the younger group. We will show how to combine the 
groups later using logistic regression, although this is not 
the key issue of this paper. 

For each age-race-sex group, the k™ individual in the i® 
county belongs to one of J BMI levels. Then for the k™ 
individual in i® county, the characteristic variable at the 
j™ BMI level is defined as follows, 


a Lor We) 2 
ie Oar sees Xjigs ae Maye inte ly 
where each x, =0 or 1, j=1,..., J, and Yi X= L. 


The response ome Vijn is defined for each age-race-sex 
domain 


1, if individual k belonging to BMI 
level 7 in county i responded 


O, if individual k belonging to BMI 
level 7 in county i did not respond. 


We use a probabilistic structure to model the x,, and Vijke 
In our application, there are c = 34 counties and if 3 BMI 
levels. 


3.1. Ignorable and Nonignorable Nonresponse 
Models 


For both ignorable and the nonignorable nonresponse 
models, we have 


Xa] De d Multinomial Gina) (1) 


where Pi is the piopa that an individual in the i® 
county belongs the j‘" BMI level. Next, we describe the 
remaining portions of the ignorable and the nonignorable 
models. 

First, we describe the ignorable nonresponse model. Let 1, 
denote the probability that an individual within the i® 
county responds (i.e., the probability of responding depends 
only on the county). Then, we assume that 


Vay | LC Sareea seep cay (2) 
At the second stage, letting mW, = (M1), Myo» «+» Hy)’, we take 
P,|Hypt, wf Dirichlet (u, t,), (3) 


f Beta (U5, t, (1 


T, | Moy, To, ~ ~H5,) Ty) (4) 


where 
Ji 
yu 
pmsl at) Ty us (Diwueay). OsPystids Py=l 
- 
and 
J J 
D(H, 1) = TT Wy, ;%)) /T@)0<n,;<1 Do pal 
> pz 


The components of p1, are the prior means of the corres- 
ponding components of the p,, and t, can be interpreted as 
a prior sample size. Similar interpretations can be given for p,, 
and t,, for x,. Thus, assumption (3) expresses similarity 
among the cell proportions p, and (4) expresses similarity 
among the response probabilities 7,. It is this structure that 
causes the “borrowing of strength” across the c counties. 

Second, we describe the nonignorable nonresponse 
model. Let ey ij denote the Boe that an individual 
within the i® county responds in the 7" BMI level (i.e., the 
probability of responding depends not only on the county 
but also on the BMI level). Then, we assume that 


Vik | {X,, - (Xiag> seey Xp) T;;} iid Bernoulli (1,;) (5) 


where Xijy = i Xiu, =O, FFT’ fOr py Se ee ee 
Letting M, = (M5),M39, ++» H3,)’, at the second stage we also 
take 

P,|H;,t, 29 Dirichlet (q, t,) (6) 
and 


iid 
Ty |My j> Taj " Beta (yu, 1,;,(1 ws My )Tj)>J = Le #09 J. (7) 
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Like the assumptions in (3) and (4), the assumptions in 
(6) and (7) express similarity among the counties. We note 
that the response parameters m,, are weakly identifiable 
(i.e., unreliable estimates). However, the selection model 
Be to our advantage, because the joint density of x,, and 

=(Y, 149 +9 Yj.) Connects the Pi and 7; . In fact, this 
is an advantage over the pattern mixture approach. 

To ensure a full Bayesian analysis, at the third stage we 
take the prior densities for the hyper-parameters as follows. 
For the ignorable nonresponse model, the prior densities are 


wi, ~ Dirichlet (1,1, ...,1), p,, ~ Beta (1,1), 


~ Gamma (ny Vy i) and T) ~ Gamma (no? ay 


where (letting ¢ denote either t, or t,,, a either ny or she 
and b either ve or Vee ) t~Gamma (a, b) means that 
$b) = bot" he IT (ay t “a 0 ae f (t) =0 otherwise. 
The hyper-parameters 1; hake Vers etd and Vey. are to be 
specified. The corresponding part of the nonignorable non- 
response model is 

pn, ~ Dirichlet (1, 1, ..., 


1), 4; ~ Beta (hk): 


~ Gamma Ghee ye) and 


~ Gamma (ni; re) akt CROs 


Again, the hyper-parameters rie ve ne Vai > Tahoe, J 


are to specified. It is possible to use other prior densities 
such as shrinkage priors, but it is likely that these will 
provide similar inference as our sensitivity analysis 
indicates in section 4. 

It is an attractive property of the hierarchical model that 
it introduces correlation among the variables. For example, 
in our application (1), (2), (3) and (4) make the (x; Vi) 
equi-correlated across the individuals within the 1 area. 
This is the clustering effect within the areas. Such an effect 
can be obtained directly, but it will not be as simple as in a 
hierarchical model. A further benefit of the hierarchical 
model is that it takes care of extraneous variations among 
the areas, and this effect can be obtained directly by using 
random effects model. But in our case, this will loose the 
natural multinomial data structure. 

Let r, be the number of respondents in county i and y,, 
the number of respondents having the j BMI level in the i a 
county. Then r, and y;; are random variables; n, - r, is the 
number of nonrespondents. Since the number of non- 
respondents at the 7 BMI level is unknown, we denote 
them by the latent variables z,, (see the tree diagram in 
Figure 1). If we can tell what the z,, are, our nonresponse 
problem will be solved. Of course, under the assumption of 
ignorable nonresponse, they can be estimated easily using 
ratio estimation. The z,, are useful because under the 
assumption of nonignorable nonresponse they simplify the 
sampling based method to obtain estimates of the 
parameters of interest. 
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Yi Yi2 Yis 251 252 2i3 


Figure 1. Latent nonignorable response tree diagram. From a sample 
of n, individuals, there are r, respondents of which y,, 
belong to category j, j = 1, 2, 3. Among the (n, - r;) 
nonrespondents Z;; individuals belong to category j, where 
Z;; are latent variables. 


The likelihood function for the ignorable nonresponse 
model is 


c nN; 4 es 
f(y.r|p,x)=[] Gl =m, 
fm r; 
c r; lf pe 
xT] TI {p"" Hh. 
i=1 Viyr oor Se ay Re 


Here the likelihood function has two distinct parts, one for p;, 
and the other for the 2,. Using Bayes’ theorem the joint 
posterior density of all the parameters is 


f(D, %, My, TyMyy> T | YF) 


3 fe a _ay| 
aia D(m,%)) 
Mot! 


(1-H91)t 1-1 
aaa So) ki 


B( U5) Ty, (1 ~My} Ty) 
(0) 
x ‘e “exp (- yn >} a 


Similarly, the augmented likelihood function (Ze., 
including the z,) for the nonignorable nonresponse model 
iS 


(8) 
fe nop 


Nn; ie 


f(y,r,z|p,7)=[] 
i=1 


Te) \ Jive Viz) \ Five ve Sis 


J 


x I] {c T,, Py)! (C1 -mpnp*}| 


I= 
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and using Bayes’ theorem the joint posterior density of all 
the parameters is 


f(p, 1, Z, H,,73, H,, U4 | y,; r) 
Vij cae 
She II (m,;Py)" (CL - my) Py)” 


a of 
Hat = 

«Tay / Day I 

Te LE 


H4jt4j~1 qd “Hata 1 


Buy; 14 js (1 ~My) Ts) 


my 1 (0) Maj (0) 
x)t, exp (-¥v, T;) Il 14; eXP (- V4; T,;) f- 
j=l 

We consider inference about the Pj» the proportion of 
individuals at the 7" BMI level in the i county, and the 
probability of responding, 


i 
6, = », Ti; Piz» t = PRS 
n pe 


However, the joint posterior densities in (8) and (9) are 
complex, and can not be used to make inference analyti- 
cally. Thus, we use a Markov chain Monte Carlo algorithm 
to obtain estimates of the posterior distribution of the para- 
meters. Our method is to use a Metropolis-Hastings (MH) 
sampler to get samples from (8) and (9) and then to use 
these samples to make posterior inferences about p, and 6.. 


3.2 Computations 


For the ignorable nonresponse model, it is convenient to 
represent the posterior density function as 


f (D, 4, Hy, T, Moy Ty | Y. 4) 
ae Il £7, YE Hs Gs CRY Eitlen bh )} 
i=] 


X fz (My Ty, Moy Ty ly.r) 


where f,(-) is Dirichlet density, 


ind 
P|¥-%jBpt, ~ Dy, +n,-7, +B, %), 


f,(-) is beta density, 


ind 
TA ¥,s0j>Hop toy. Beta +H, tt ars (1-H )t,,) 


and 


Ff, (Hy, 4, M515 Ts | Vi) 


<T] {D(y, +n, - 7, +m, 7,)/Din, 4,)} pH, 2) 
i=1 


fA DO SMa, Cytol, CK Mehta) 
<t i 21 2irert PA Mee! P(byyT) 


i=1 B(u,%,C1 —H5,)T,,) 

with p(p,,T,) and p(w,,,7,,) the prior distributions. 

Hence, f, and f, are obtained through the Gibbs kernel, 

while for f, we use the MH algorithm (Nandram 1998). 
For the nonignorable nonresponse model, it is con- 

venient to represent the posterior density function as 


f(D, %, Z, My, T3, My T, | YT) 


G J 
=[[ 1 Il f,(a,ly,¥, Z, Hyp wht (Pp; | y; rans} 
i=1 {| j=1 


ra Breed @ Ler T, Hy, T4, Zz | y; Yr), 


where f,(-), .... f,(-) are beta densities, 
ind 
Teil Yipp Tip Zip Map Vai 
Beta(y,;+MyT,.%;+ 1 My) T,)s 


f;,,© is a Dirichlet density, 
ind 
Pil Ye 2ssst, OY, cae ee 


and f,,,(-) is given by 
Ff p.y (Mg Ty» My Ty Z| YS) 


n;~T; 


TT} {{Dty,+4,+05,)/D08%)} p(y.) 
i=1 


Zip Sy 


xT] BYiz + My jTap Sy +(1 ~My) T,;) D(Hyt,) 
j=l B( My jT 4p C1 ~My) Ty) 


with p(M,,T,) and p(M,, T,) the prior distributions. Thus, 
f 3+» f,,, are obtained through the Gibbs kernel, while 
f;,, 18 obtained using the MH algorithm (Nandram 1998). 
We obtain the latent variables z, j through one of the condi- 
tional posterior densities of the MH algorithm. A sketch of 
the procedure is given in Appendix 1. 

We drew 5,500 iterates, threw out the first 500, and took 
every fifth (obtained by trace plots). This strategy was 
satisfactory to wash out the autocorrelation among the 
iterates and to have good jumping probabilities (0.25-0.50) 
for the Metropolis steps. For the computation, first we set 
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(0).4, Oy. (0). _(0)_2 (0). (0) O).. (0) 
esiewee tien i et ere er ign sn hay ow dw 


7 =1,...,J equal to 0. Then we ran our MH algorithm to 
See posterior samples of Ty, 715 T, and Ty js prsdyens te 
Le ensure ee , posterior Ooms we estimate 
"1 : vy : Mai» Y9 eh i ny v3 : 4° ye j=1,...J, by fitting 
the gamma priors on the posterior samples a Tpity,, Fz-and 
gee as een These values are shown in Table 3. Finally, 
with these proper priors we ran our algorithm to obtain 
posterior ee: Specifically, we obtained M = 1,000 
iterates (p” : oe oe h=1,...,M,i=T, ..., c: Inference about 
the p,, 6, and any function of them can be made using these 
iterates in a straightforward manner. 


Table 3 
Estimates of 7 and v corresponding to the gamma densities 
on T,,T,, for 45+ and t,,'T), t55 Ta for 45- by race and sex 


Age 
45- 454+ 
Race Sex T; Fe i ure < cs 
Ww M_ © 3.698 2.341 3.085 2.685 4.408 3.941 


vy 036 .071 .201 .163 .009 .052 

Fo” 1.819 4.788 4.384 

yom “030-059 072 2017 008 a2 O19 

B M_ 1 4.948 2.922 3.156 2.404 5.971 4.376 
vy 068 .096 .169 .147 .107 .036 

F © 3.745 3.084 1.893 2.350 3.292 4.488 

vy 055 .036 .049 .116 .009 .036 


4. ANANALYSIS OF THE NHANES III DATA 


In this section we illustrate our methodology using the 
BMI data from NHANES III. First, we study our estimates 
based on summary measures over the counties. Specifically, 
we use the weighted posterior distributions of the p;,, 


Bid, np; / > n,;, J =1,2,3 
i= i=l 


and the weighted posterior distribution of the 6, 


for each of the eight age-race-sex domains. Then, for the 
first four examples in Table 2 we show small area effects. 

We also show how to relate the p,,, and the T,, to age, 
race and sex using linear and nonlinear logistic regression 
models 


4.1 Data Analysis 


First, we performed a sensitivity analysis to assess the 
specifications of n© and v. We compared three choices 
of hyper-parameters Q = (yj, v) to check the sensitivity 
of the specification of the hyper-parameters on inference. 
Our first choice is 4 times of Q, i.e., 4Q = (4n, 4v); 
our second choice is the hyper-parameters without any 
change, i.e., Q=(n,v); and our third choice is one 
fourth of Q ie., Q/4 =(n/4, v/4). 
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Table 4 shows the simulation results for the sensitivity to 
the inference of P; for the younger group (45-). The point 
estimates and standard deviations of the proportions are 
very similar over the three choices of hyper-parameters. 
Similarly, Table 5 shows the simulation results for P; for 
the older group (45+). The point estimates for males are 
very similar over the three choices of the hyper-parameters, 
but there are small changes in the point estimates for 
females from 4Q to Q. The standard deviations are 
increased when © decreases for the females, but no 
substantial changes are detected for males. Generally, the 
nonignorable nonresponse model performs better than the 
ignorable nonresponse model, as the nonignorable non- 
response model is not sensitive to choices of the hyper- 
parameters. 


Table 4 
Sensitivity of p , for choice of eas Bad nas and Rane j=1,..54 


for the younger group (45-) for the three BMI levels 


Race Sex Pp,  std(p,) p,  std(p,) p,  std(p,) 
(a) 4Q 
W M 428 .022 .216 .019 356 .022 
16 476 025 232 .020 .292 .024 
B M 19 .020 PAP .016 .369 .020 
F 434 .026 185 .023 381 .027 
(b)Q 
WwW M 427 022 Pali .020 362 .025 
F 476 .026 .223 .024 301 .031 
B M A19 .020 .208 .017 1S .022 
iP 435 025 .178 .026 387 .029 
(c) Q/4 
WwW M 427 022 .210 021 364 .027 
F 475 .026 .220 .026 304 .034 
B M A419 .020 .206 .018 315 .024 
F 435 .025 la) .028 .388 .029 


(0) yo (0) _ (0) ©) _() _ (0) 0) 
Note 1: = (13 °,¥3 > Mar» Yar » Naz Yar > Nas» 43) 


Note 2: The nonignorable nonresponse model is applied to the 
younger group. 


Table 5 


Sensitivity of p . for choice of ihe To: i ee for the 


older group (45+) for the three BMI levels 


Race Sex Pp, std(p,) p, std(p,) fp,  std(p;) 
(a) 4Q 
W M .030 .005 .306 .018 .664 .018 
r .081 .002 .436 .004 483 .004 
B M .053 011 Biz .017 .630 .018 
F .075 005 .201 .004 724 .006 
(b) Q 
W M .031 .005 .292 .016 .677 .016 
F .063 .002 443 .006 494 .005 
B M .053 011 .316 019 631 .020 
18 .066 012 Pas | .018 .697 .019 
(c) Q/4 
W M .031 .005 293 018 .676 019 
F .073 O15 359 011 568 019 
B M .053 .010 S17 018 .630 .019 
F .065 .013 221 .022 714 L025) 


Note 1: 2 = (10, v., nf, vs). 
Note 2: The ignorable nonresponse model is applied to the older 
group. 
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Table 6 
Point estimates and 95% credible intervals for the weighted probability of response, 5 = pres le Rees: 
for three choices of Q and the younger group 


4 Q Q/4 
Race Sex 6 std( 5) Interval 5 std(6 ) Interval 5 std( 5) Interval 
WM 775, 016, G744...805) 769. 017. C735. 801). 767 .018 (.732, .799) 
FY 3802 OL 9, CS 20060) 6.55) O20 ColOeS si) mes oS Ad ph: (.806, .887) 
BM VI8SOS.01G “C752, 5517) "780" .01SPieG40. 815) Ba778 .018 (.739, .811) 
F 2880 201303 €854,0902) ©2878) 0154845. O03) anERT6 .015 (.838, .903) 


Note: See the note to Table 1. 


Table 6 shows point estimates of the probability of 
responding 4, and their 95% credible intervals for three 
choices of Q. The probabilities of responding for males are 
lower than those for females, and this pattern remains the 
same for three choices of 9. If a similar survey is 
conducted in the future, we should increase the sample size 
by 1.30 = (1/.769) times for white males and 1.17 = (1/.855) 
times for white females (e.g., if complete data are required 
from 1,000 households, the interviewer needs to contact 
1,300 white males). 

In Table 7 we present 95% credible intervals for the P; 
for the three BMI levels. For the younger group, p, of 
BMI level 1 is the highest, and p, of BMI level 2 is the 
lowest. The lower bounds for p, and p, are similar for the 
younger group except for white females, and those for p, 
are similar except for the non-white females. For the older 
group, Pp, of BMI level 3 is highest, and p, of BMI level 
1 is lowest. Specifically p,, p, are high and jp, is low for 
the white males. 


Table 7 
95% credible intervals for the weighted proportions, 
B= Li-1 "Py! Li-1 7, by age, race and sex 


95% credible interval 
Age Race Sex Py Ps Pp, 
45- W M_  (.382,.470) (.174, .252) (.314, .412) 
PO C425, 25) CLL 200) (.243, .371) 
B M_ (381,.455) (.176, .241) (.333, .419) 
Bey (385,482) C130, 230) (.329, .442) 
45+ W M_  (.022,.041) (.255, .326) (.643, .710) 
F (059, .068) (.431, .451) (.486, .505) 
BoraM (035,.076) (282, .352) (.592, .670) 
F__(.040, .093 .206, .265 Ole 731 
Note 1: The nonignorable nonresponse model is applied to the 
younger group. 
Note 2: The ignorable nonresponse model is applied to the older 
group. 


As suggested by a referee, we have looked at the results 
for older white females (45+) in Table 7 in greater detail. 
From Table 1 the observed proportions in the three BMI 
levels are .079, .347 and .568. However, the 95% credible 
intervals for the population proportions in Table 7 are 
(.059, .068), (.431, .451) and (.486, .505) respectively. That 


is, while the observed proportions are close to the intervals, 
none of these intervals contains the observed proportions. 
We can explain this phenomenon in the following manner. 
The data for older white females (45+) are very sparse. For 
the 34 counties the quartiles of the observed counts in the 
three BMI levels are (0,1,3), (3,6,10) and (5,9,14) respec- 
tively. Thus, when the ignorable nonresponse model is fit to 
the 34 counties, there is shrinkage not only across the 
counties but also across the BMI levels. Consequently, the 
largest proportion tends to be smaller and the smallest pro- 
portion tends to be larger, and since the three proportions 
must add up to one, the second proportion must also 
“shrink” somewhat. In addition, consider the sensitivity 
analysis in Table 5. We can approximate 95% credible 
intervals for p,, Pp, and p,, by using the posterior mean 
+2. standard deviation. The intervals at 4Q and Q do not 
contain the observed proportions, but the intervals at Q/4 
do. Therefore, because of the sparseness of tha data, there 
is some sensitivity to inference for older white females 
(45+) with respect to the prior misspecification of Q. These 
results are expected within the small area context, when 
there are sparse data. 

We use the first four examples in Table 2 to illustrate 
small area estimation. As it can be imagined, it is too 
cumbersome to present all the estimates for the 34 counties 
and the 8 domains. Table 8 shows the posterior means, 
standard deviations and 95% credible intervals for the Py 
and the 6,. 

First, we compare the estimates of the Pi from the 
ignorable and nonignorable nonresponse models. The 
estimates from the two models are generally different with 
the intervals for the nonignorable nonresponse model wider 
than those for the ignorable nonresponse model. 

Second, we consider the estimates (based on the non- 
ignorable nonresponse model) of p,, for the individual 
counties in Table 8 with the overall averages, the p. in 
Table 7. As expected, when the # . are obtained, there is an 
overall reduction in variability because of the extra 
smoothing, thereby making the intervals for the smaller 
domains relatively much wider. In fact, all the intervals for 
the small domains contain the intervals for p ; 


Finally, in Table 8 we consider the estimates of p i; for 
the individual counties with the overall average, P; in 
Table 7. The message is similar to that for the Pj: 
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However, we note that the first example is an exception 
where the credible interval for 6,(.459, .773) is almost 
completely to the left side of the credible interval for 
6 (.735, .801). Thus, there is much shrinkage for this 


example which is due to the relatively large number of 


nonrespondents, 14 in this county for white males 45-. 


Table 8 


Comparaison of the ignorable (ig) and the nonignorable (nig) 
nonresponse models for the four examples (Ex) corresponding 
to small domains using the cell probabilities ( P; ) and the 
probability of responding (6) 


Ex Model P, P> DP; ) 
Mig) nave 09.444 308 248 
std .073 .067 .067 
Gly (29768593) (193,450) 4) C125, 386) 
nig avg 450 .276 1S .637 
std .093 .079 .082 .081 
CI (256, .638) (.137, .444)  (.133, .448) (459, .773) 
Delo eave: 480 308 HV 
std 075 .066 .062 
CI (.324, .619) (.193,.452)  (.097, .344) 
nig avg 493 .263 244 879 
std .074 065 .062 041 
CI (.338, .628) (.141,.406) (.121,.394) (.782, .948) 
3 ig avg  .420 306 274 
std .O71 .063 .063 
CI (.276, .561) (.192, .437)  (.161, .416) 
nig avg 438 wiv) 310 741 
std 079 .072 .074 058 
CI (.283,.591) (.116,.406)  (.186, .483) (.607, .836) 
4 ig avg 448 .263 .288 
std .089 075 081 
CI (.278, .620) (.127, .424)  (.138, .468) 
nig avg 430 .261 308 874 
std .100 .086 091 .046 
(.217, 619) (.104, .453) (.145, 517) _(.768, .948) 
Note: For each parameter avg = posterior mean; std = posterior 


standard deviation; CI = 95% credible interval 


4.2 Linear and Nonlinear Logistic Regression 


Models 


et q;;, denote the probability that ie respondent in 
st n= We 8) age-race-sex group in the i 
to the j BMI level. (We add the subscript / to the pj to 


county belongs 


denote the domains.) whe Vig = log { 4 diez! 
3fJ = A,ewe take 


(u,+a,)) /y, 


1 adeeb J melee 
Yep 9; 7 


(10) 


subject to the constraints );_,p,=0, Mis eg "8, = a 
ra ay = Dans)  salld? y, = 0. The parameters 0, Up 
and y, in (10) have iraeshtnn distributions whose ean 
are inherited from the posterior distributions of q;,,. Each 
iterate of the MH algorithm provides a value for dij1 which 
is used in (10), and a nonlinear least squares problem. i is 
solved using an iterative method to get the values of 
0; u;,a, and wy, (see Appendix 2). Alternatively, we can 
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also use the much simpler linear logistic model in which the 
y; in (10) are taken equal to unity. In this case, the least 
squares estimators of @. 0; M; and a, exist in closed form 
at the h" iteration of MH algorithm. Seka Fae for 
Q; = 0, we have the least squares estimates fi; =v... - v, 


6, = Vis Dba se v ope where 


ire Sybase ee yi vj4,/ 8c (J - 1), 
i J=108 
Ve a yas viH1 ¥;4/ 8S 7 1), 


8 


and ¥,= Penh, it v,;/¢(J-1). The nonlinear least 
squares problem is solved using an iterative method to get 
the values of 6,, 0,, fi, and @,. 

We present 95% credible intervals for 9,,9, and 
O.,, ++, for the younger and older groups by regression 
type in Table 9. For the cut-points 0, 0, gives a large nega- 
tive effect compared to @,. The relative measure 
a,(1=1,...,4) of the younger group gives a negative 
effect, while the relative measure a,(/ =5,...,8) of the 
older group gives positive effects. The 95% credible 
intervals for linear and nonlinear estimates are essentially 
the same. 

We also relate the probability of response, 6, = 
Med Pap {Pip to race and sex using linear and nonlinear 
logistic regression models for the younger group. The 95% 
credible intervals for 8 and a,, ...,a, for the young group 
by regression type are shown in Table 10. Credible intervals 
for all a, for the nonlinear model are shorter than those for 
the linear model. However, for the nonlinear model the 
credible interval for 6 is wider than and on the right of that 
for the linear model. 


Table 9 
Comparaison of 95% credible intervals for 0,,0, and q,, ... 
for both younger and older groups by regression type 


, A. 


Linear Nonlinear 
0, (-1.743, -1.469) (-1.731, -1.466) 
0, (0.028, 0.196) (0.025, 0.193) 
Q, (-1.167, -0.751) (-1.159, -0.751) 
a, (-1.395, -0..939) (-1.385, -0.937) 
0, (-1.127, -0.723) (-1.119, -0.728) 
a, (-1.112, -0.659) (-1.103, -0.658) 
a, (1.198, 1.514) (1.188, 1.498) 
O, (0.513, 0.689) (0.506, 0.685) 
a, (0.715, 1.210) (0.725, 1.225) 
Os (0.809, 1.310) (0.803, 1.300) 


a 
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Table 10 
Comparaison of 95% credible intervals for@ and 


Q,,...,@, for the younger group by regression type 
Linear Nonlinear 
) (1.455, 1.729) (1.664, 2.174) 
0, (0.165, 0.592) (0.146, 0.523) 
a, (-0.535, 0.014) (-0.467, 0.007) 
0, (0.078, 0.546) (0.079, 0.484) 
O, (-0.704, -0.165) (-0.638, -0.169) 


5. ASIMULATION STUDY 


We describe a small simulation study to assess the 
performance of our multinomial nonignorable nomesponse 
model. We focus on the probability of responding. 

We use the observed data from younger white males to 

obtain the posterior means of p;,, Pj», P;3 and 7;,, T;>, T; 
for each county. mes are taken to be ine true values 
which we denote by ps : Dae ps and nm”, M9 ae Thus, 
the true probability of responding in the i” county is 
6? = es ; Dig ip and the weighted probability of 
responding is 6® = ere 8?) Yj -17;,- In our simulated 
examples, we used the n, as in me BMI data for younger 
white males, and we kept the ae fixed throughout. How- 
ever, we varied the 7,, in the following manner. We kept 
m,, fixed at nm? : and we denote the vector of the 7, by 
TL, ’ The 34 values of the m? range from .73 to .83. Then, 
weiset 1, = an, and x, = br,, where a, b = 0.8, 0.9, 1.0. 
(We denote the vectors of the m,, and the m,, by 7, and 
7, respectively.) Thus, there are 9 simulated examples. 

Then, for each (a, b) we generated counts for a multi- 

nomial probability mass function with aa es 
iy Ty» Piz Mp Bey bes bi Pit (1 - m,), Ph (1 - m,), 
De (1 - ,,). We denote these cell counts by y,,, Yo» Vix 
21» 2g» 23 and the number of respondents is 7, = = See 
Then, we fit the nonignorable nonresponse model to the 
above data ne, pe ee sampler, and we obtained M = 
1,000 values ( Pips ; Tj ”),h=1,...,M. For each value, we 


computed ape a) re 081 ve _\n, where 98” = 

Ih Py ij my i 

In Table 11 we report posterior means, standard devia- 
tions, numerical standard errors (using the batch means 
method) and 95% credible interval for the probability of 
responding for each choice of (a, b). We also computed 
Pr (5 <8 Ly, r) by counting the number of 5 that are 
as large as 65. An extremely large or small value of this 
latter quantity suggests model failure. 2 

We plotted the estimates of the posterior densities of 6 
by choices of a and b which we obtained by using normal 
kernel density estimator with an optimal window width 
from an output analysis of the MH algorithm. The densities 
are an unimodal, peaked and almost symmetric. By 
increasing (a, b) from (0.8,0.8) to (1.0, 1.0), the mode of the 
posterior densities increase. 


Table 11 
Characteristics of the probability of responding 
EA) 
Tt, stat 0.8 * 7, 0.9 * 2, 1.0 *z, 
0.8 +7, true 0.690 0.719 0.748 
avg 0.712 0.739 0.764 
std 0.016 0.015 0.014 
nse 0.0030 0.0031 0.0029 
CI (0.678, 0.742) (0.708, 0.767) (0.734, 0.750) 
prob 0.082 0.095 O33 
0.9*2, true 0.706 0.735 0.764 
avg 0.710 0.742 0.776 
std 0.017 0.016 0.014 
nse 0.0030 0.0031 0.0031 
CI (0.673, .0.742) (0.712, 0.769) (0.745, 0.802) 
prob 0.377 0.303 0.210 
10+, true 0.722 0.751 0.780 
avg 0.726 0.758 0.784 
std 0.017 0.015 0.015 
nse 0.0036 0.0036 0.0026 
CI (0.693, 0.757) (0.725, 0.784) (0.750, 0.809) 
prob 0.399 0.318 0.380 
Note: avg = posterior mean; std = standard deviation; nse = 


numerical standard error; CI = 95% credible interval; 
prob=Pr (5 <8” | y, r); the 34 values of 2, range from .73 
to .83. 


In Table 11 we show that all the credible intervals 
contain the true values and the posterior means are close to 
the true value with the least discrepancy for the near igno- 
rable nonresponse cases. The standard deviations are very 
similar across the nine simulated examples. Also, the nu- 
merical standard errors (nse) are small and similar for all 
nine simulated examples. The estimates of Pr(5 <8™|y, r) 
range from 0.30 to 0.40, except for the most nonignorable 
nonresponse cases in which (a, b) = (.8, .8) and (.8, .9). 
Thus, the model does perform reasonably well. 


6. CONCLUSION 


We have described a Bayesian methodology that can be 
used to analyze multinomial data for small areas when there 
is nonignorable nonresponse. A hierarchical model is used, 
and we have shown that it performs reasonably well. In fact, 
we have extended the method of Stasny (1991) in two 
directions: (a) we have considered multinomial data with 
more than two cells (binomial) and (b) we have done a full 
Bayesian analysis. Both (a) and (b) have been implemented 
for small areas 


The Markov chain Monte Carlo method permits an 
assessment of the complex structure of the multinomial 
nonresponse estimation. Our empirical analysis and simu- 
lation study indicate good performance of the model for 
these data. Thus, the method of ratio estimation currently 
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used in NHANES III may be replaced by our Bayesian 
method as the nonrespondents’ characteristics might differ 
from those of the respondents. In fact, an application of our 
model to the NHANES III data shows that in each county 
there are substantial differences in the proportions of 
individuals at the three BMI levels by age and sex. This can 
be seen in Table 1 when the observed counts are summed 
over the counties. But, we have obtained inference 
(including measure of precision) for each county by age, 
race and sex. 

Our methodology can be extended in three ways. First, 
it is feasible to use a model that incorporates an extent of 
nonignorability, rather than just the dichotomy of ignorable 
nonresponse and nonignorable nonresponse. Second, one 
can use other prior distributions (e.g., Dirichlet process 
prior) to model heterogeneity in the clustering of the areas 
rather than assuming homogeneity of the areas as we have 
done. Third, one can use a fourth stage in our model to 
accommodate clustering within households as well as 
clustering within areas (counties) in NHANES II. These 
tasks are very difficult. 


ACKNOWLEDGEMENT 


This work was done at the National Center For Health 
Statistics while Balgobin Nandram was the first ASA/ 
NCHS Research Fellow and Geunshik Han was on sabbat- 
ical leave from Hanshin University, Korea. 


APPENDIX 1 


Metropolis-Hastings Samplers 


For the ignorable nonresponse model, (p,,7,) and 
([1,,»T),) are independent a posteriori with 


JD (y, +0, -7 7417) 
( ? y,r)a ( ied) nA a a aa. A.l 
P(H,, T, | P(M, at aan ) 


and 
Dias o| Vk ODE ho 0) 
S |B rth Hp 47+ C1 ~ Wy) Tt) (A.2) 
i=1 B( My M,C ~ Hy} ) Ty) 


where p(H,,T,) and p(p,,,T,,) are the prior distribu- 
tions. Samples can be obtained from each of (A.1) and 
(A.2) using the MH algorithm of Nandram (1998). 

For the nonignorable nonresponse model, it is conve- 
nient to condition on z to obtain 


e. | D(yy+%+ 5%) 
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where p(H,,T;), P(Mg; Ty), [mays are the prior 
distributions. Given z, (A.3) and (A.4) are independent with 
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We ran the MH sampler by drawing a random deviate from 
each of (A.3), (A.4), and (A.5). It is easy to draw a random 
deviate from (A.5). Samples were obtained from each of 
(A.3), (A.4) and (A.5) using the MH algorithm of Nandram 
(1998). 


APPENDIX 2 


Nonlinear least squares estimates 
Let 


i I] 
a) qin | Lae a beta dter 
s=1 s=l 


These v,,, are obtained for each iterate from the 
Metropolis-Hastings sampler. To solve the nonlinear least 
squares problem we minimized 


Coe eS ‘ 2 
es apaeed Man (9, - (u, + 4))] 
i=1 j=1 T=1 ; 
subject to the constraints ae 1M, = 0, yD, 0, = 0, ES 
: be c a 
a, = 0, and letting e=w, ,)j-; In wy, = 0. 
Taking partial derivatives to find the least squares 
estimate, we have 


(A.1) 


=logw,;' (A.2) 


where 
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With these settings we draw the q, jl from a MH algorithm, 
and the nonlinear least squares problem is solved using an 
iterative method to get values of Q;; 0;, u, and a,. Let 


j 
(h) 
Liege af ? 
s=l 


J 

(h) _ (h) 

Vij = log Ss isi | 
s=1 


where A denotes the value of q,., at the h™ iterate of the 
MH algorithm. Then we minimize (A.1) subject to the above 
constraints at the h™ iterate to obtain ”, Se oe and 
o These iterates provide an estimate of the posterior 
distributions of @,, 0, ul, and a. Convergence occurred for 


our application in less then 10 iterations. 
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Assessing the Bias Associated with Alternative Contact Strategies in 


Telephone Time-Use Surveys 


JAY STEWART’ 


ABSTRACT 


In most telephone time-use surveys, respondents are called on one day and asked to report on their activities during the 
previous day. Given that most respondents are not available on their initial calling day, this feature of telephone time-use 
surveys introduces the possibility that the probability of interviewing the respondent about a given reference day is 
correlated with the activities on that reference day. Furthermore, noncontact bias is a more important consideration for 
time-use surveys than for other surveys, because time-use surveys cannot accept proxy responses. Therefore, it is essential 
that telephone time-use surveys have a strategy for making subsequent attempts to contact respondents. A contact strategy 
specifies the contact schedule and the field period. Previous literature has identified two schedules for making subsequent 
attempts: a convenient-day schedule and a designated-day schedule. Most of these articles recommend the designated-day 
schedule, but there is little evidence to support this viewpoint. In this paper, we use computer simulations to examine the 
bias associated with the convenient-day schedule and three variations of the designated-day schedule. The results support 
using a designated-day schedule, and validate the recommendations of the previous literature. The convenient-day schedule 
introduces systematic bias: time spent in activities done away from home tends to be overestimated. More importantly, 
estimates generated using the convenient-day schedule are sensitive to the variance of the contact probability. In contrast 
a designated-day-with-postponement schedule generates very little bias, and is robust to a wide range of assumptions about 


the pattern of activities across days of the week. 


KEY WORDS: Telephone time-use surveys; Contact strategies; Bias; Computer simulations. 


1. INTRODUCTION 


Telephone time-use surveys present a unique data 
collection challenge because respondents are called on one 
day and asked to report on their activities during the 
previous day. The challenge arises because most 
respondents — about 75% (Kalton 1985) — are not contacted 
on their original calling day, necessitating additional 
contact attempts. In most surveys, it does not matter when 
these additional attempts are made, because respondents are 
being asked to report about a fixed reference period. And in 
most surveys recall does not suffer too much if respondents 
are contacted several days after the initial calling day. But 
in time-use surveys, respondents’ ability to recall their 
activities on a given day falls off dramatically after a day or 
so, which means that the respondent must be assigned a 
new reference day if no contact is made on the initial 
calling day. As we will see below, this scenario introduces 
the possibility that the probability of interviewing the 
respondent about a given reference day is correlated with 
the activities on that reference day. Therefore it is essential 
that these surveys have a strategy for making subsequent 
attempts to contact respondents that does not introduce bias. 


Contact Strategies 


A contact strategy is comprised of a contact schedule and 
a field period. The contact schedule specifies which days of 
the week that contact attempts will be made, and the field 
period specifies the maximum number of weeks attempts 
will be made. 


1 


Contact schedules fall into two main categories: 
designated-day schedules and convenient-day schedules. 
Both types of schedule randomly assign each respondent to 
an initial calling day. If the respondent is contacted on the 
initial calling day, the interviewer attempts to collect infor- 
mation about the reference day, which is the day before the 
calling day. It is for subsequent contact attempts that these 
schedules differ. 

Under a designated-day schedule, there are two 
approaches to making subsequent contact attempts. The 
interviewer could call the respondent on a later date, and 
ask the respondent to report activities for the original 
reference day. This approach maintains the original 
reference day, but extends the recall period. Harvey (1993) 
recommends allowing a recall period of no more than two 
days. The second approach is to postpone the interview and 
assign the respondent to a new reference day. Kalton (1985) 
recommends postponing the interview by exactly one week, 
so that the new reference day is the same day of the week as 
the original reference day. 

These approaches are not mutually exclusive. For 
example, Statistics Canada’s designated-day schedule 
allows interviewers to call respondents up to two days after 
the reference day (Statistics Canada 1999), and to postpone 
the interview by one week if the respondent cannot be 
reached after the second day of attempts. The interview can 
be postponed no more than three times (Statistics Canada). 
To illustrate, if the initial reference day is Monday the Ist, 
the respondent is called on Tuesday the 2™ and, if 
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necessary, on Wednesday the 3”. If no interview is obtained 
on either of these days, the respondent is called on Tuesday 
the 9" and, if necessary, on Wednesday the 10", and asked 
to report on activities done on Monday the 8". This process 
continues until the respondent is interviewed, refuses, or 
until four weeks pass. 

The convenient-day schedule does not maintain the 
designated reference day. If no contact is made, the 
interviewer calls on the next day and each subsequent day 
until the respondent is contacted. Once contact is made, the 
interviewer attempts to complete the interview or, if the 
respondent is unwilling to complete the interview at that 
time, reschedule it to a day that is convenient for the 
respondent. The reference day is always the day prior to the 
interview. It is worth noting that because respondents are 
not likely to schedule interviews on busy days, allowing 
them to choose their interview day is really no different 
than the interviewer proposing consecutive days (or calling 
on consecutive days) until the respondent accepts. Hence, 
one may think of the convenient-day schedule as being 
functionally identical to an every-day contact attempt 
schedule. 

A variant of the convenient-day schedule described 
above was used in the 1992-1994 Environmental Protection 
Agency (EPA) Time Diary Study conducted by the 
University of Maryland (see Triplett 1995). Respondents 
were not assigned to an initial calling day. Instead, they 
were assigned to either the weekday or the weekend 
sample. For example, those who were assigned to the 
weekend sample could be called on Sunday (to report about 
Saturday) or Monday (to report about Sunday). Interviewers 
were instructed to make at least 20 call attempts before 
finalizing the case as noncompleted. 

Most methodological papers argue in favor of using a 
designated-day schedule (Kinsley and O’Donnell 1983; 
Kalton 1985; Lyberg 1989; Harvey 1993; and Harvey 
1999). For example, Lyberg (1989) argues that the 
convenient-day schedule may introduce bias because “the 
respondent may choose a day when he/she is not busy, a 
day he/she is not engaged in socially unacceptable behavior, 
a day he/she thinks is representative, etc.” Kinsley and 
O’Donnell (1983) argue that the convenient-day schedule 
could exaggerate the number of events taking place outside 
the home, because the respondent is more likely to be inter- 
viewed on a day that immediately follows a day that he or 
she was out of the house. 

Two of these studies directly compare the designated- 
day and convenient-day schedules (Kinsley and O’ Donnell 
1983; Lyberg 1989). In Kinsley and O’Donnell (1983), the 
experimental design divided the sample into two groups. 
They found that the two schedules produced similar 
response rates, and that the demographic composition was 
similar for both samples. They also found that the estimated 
time spent away from home was much higher under the 
convenient-day schedule than under the designated-day 
schedule. But it is impossible to determine whether the 


convenient-day schedule overestimates time spent away 
from home or if the designated-day schedule underestimates 
time spent away from home, because the truth is not known. 
In Lyberg (1989), two diaries were collected from each 
respondent. One was collected using a designated-day 
schedule and the other was collected using a 
convenient-day schedule. However, the convenient-day 
diaries were conducted by an interviewer, while the 
designated-day diaries were self-administered several days 
after the convenient-day interview. So it is impossible to 
determine whether any differences were due to differences 
in contact schedules or whether they were due to mode 
effects. 

Two studies (Lyberg 1989; Laaksonen and Paakk6nen 
1992) investigate the effect of postponement on response 
rates. Both studies found that postponement increases 
response rates. Laaksonen and Pdaakk6nen (1992) also 
found that it was difficult to evaluate whether postponement 
introduces bias. Their results showed that respondents who 
postponed their interview spent less time on housekeeping 
and maintenance, and more time on shopping and errands. 
However, it is unclear whether these differences are the 
result of bias introduced by postponement, unobserved 
heterogeneity that is correlated with the postponement 
probability, or simply random noise. In any case, they 
argued that the differences were small, so that any bias was 
small. 

One advantage of the convenient-day schedule is that it 
is possible to make many contact attempts in a short period 
of time. In contrast, the designated-day schedule — as 
proposed — permits only one contact attempt per week. So 
it is natural to ask: Would it be reasonable to modify the 
designated-day schedule to allow some form of day-of- 
week substitution? For example, if the respondent cannot be 
reached on Tuesday to report about Monday, would it be 
acceptable to contact the respondent on, say, Thursday and 
ask him or her to report about Wednesday? This modified 
schedule would allow for more contact attempts without 
having to extend the field period. 

Because this type of substitution makes sense only if the 
substitute days are fairly similar to the original days, the 
first step was to determine which days, if any, were similar 
to one another. In earlier work, Stewart (2000) showed that 
Monday through Thursday are very similar to each other, 
Fridays are slightly different from the other weekdays, and 
Saturday and Sunday are very different from the weekdays 
and from each other. Hence, it would be reasonable to allow 
day-of-week substitution at least for Monday through 
Thursday. 


Activity Bias and Noncontact Bias 


When selecting a contact strategy, we need to be 
concerned with two types of bias: activity bias and non- 
contact bias. Activity bias occurs when the probability of 
contacting and interviewing a potential respondent on a 
particular day is correlated with his or her activities on that 
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day. Note that here and throughout the paper, the term 
contact probability refers to the probability of a productive 
contact (one that results in an interview). In order to isolate 
the effects of using alternative contact strategies, it is 
assumed that respondents always agree to an interview 
when contacted. Noncontact bias occurs when differences 
in contact probabilities across individuals are caused by 
differences in activities across individuals. Two simple 
numerical examples will illustrate these biases. 


Example 1 — Activity Bias: Suppose that potential 
respondents’ days fall into two categories: hard-to-contact 
(HTC) days and easy-to-contact (ETC) days. Further 
suppose that interviewers never contact respondents on 
HTC days (i.e., that P,, = 0, where P,, is the contact proba- 
bility on an HTC day), and that they always contact 
respondents on ETC days (i.e., that P,, = 1, where P,, is the 
contact probability on an ETC day). Finally, suppose that 
the probability that any day is an ETC day is 0.5, so that on 
average half of each potential respondent’s days are ETC 
and half are HTC. Note that all potential respondents are 
identical in the sense that the probability that any given day 
is an ETC day is 0.5 for all potential respondents. For 
simplicity, I assume that the activities of a given day can be 
summarized by an “activity index,” I,, where I,= 
1 -P, (J =H, E). The activity index represents time spent 
in activities that are negatively correlated with the contact 
probability. Thus, HTC days are days in which more time 
is spent in activities that are done away from home 
(working, shopping, active leisure, etc.), while ETC days 
are days in which more time is spent in activities that are 
done at home (housework, passive leisure, etc.). The true 
average activity index for the population of potential 
respondents is 0.5 (=0.5 x 1+0.5 x 0). 

If a convenient-day contact schedule is used and there is 
no limit on the number of call-backs, then HTC days are 
oversampled. To see why this occurs, it is instructive to 
work through the two possible contact sequences. If the 
initial contact attempt occurs on an ETC day, then the 
respondent is contacted and asked about the previous day 
(the diary day). Because HTC and ETC days are equally 
likely, on average half of these diary days will be HTC and 
the other half will be ETC. Therefore, the average activity 
index for the diary days of these respondents is equal to 0.5, 
which is the same as the population average. If, on the other 
hand, the initial contact day is an HTC day, then no inter- 
view takes place and the respondent is called on the 
following day. Contact attempts continue every day until 
the respondent is reached (on an ETC day). The average 
activity index for the diary days of these respondents is 
equal to one, because the respondent is always interviewed 
on an ETC day that immediately follows an HTC day. So if 
a given day is HTC (i.e., the respondent does a lot of 
activities away from home), then it is more likely that that 
day will be selected as the reference day. Hence, the 
probability of interviewing the respondent on a given 
reference day is correlated with the activities on that 
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reference day. Since half of the initial contact attempts are 
made on HTC days and half are made on ETC days, the 
average activity index for the final sample is equal to 0.75 
(=0:50¢:05-h05«x 1). 


Example 2 — Noncontact Bias: Now suppose that potential 
respondents differ with respect to their contact probabilities, 
and that the contact probabilities for each individual do not 
vary from day to day. Suppose also that half of all potential 
respondents are HTC, with P,, = 0.25 ,and that the other 
half are ETC, with P,, = 0.75. If we attempt to contact each 
potential respondent four times, given these probabilities, 
virtually all (99.6%) ETC potential respondents are 
contacted. In contrast, only 68.4% of HTC potential 
respondents are contacted. The overall contact rate is 84% 
(99.6% x 0.50 + 68.4% x 0.50), but the final sample is not 
representative: 59.3% of the sample are ETC and only 
40.7 % are HTC. Therefore, estimates based on this sample 
will tend to underestimate the time spent in activities done 
by HTC people, and overestimate the time spent in 
activities done by ETC people. 


The biases described above are not limited to time-use 
surveys. Although most surveys take steps to minimize 
noncontact bias, less attention has been devoted to activity 
bias. For example, in addition to their main focus on 
collecting event history information on employment, the 
National Longitudinal Surveys also include a few questions 
about labor force activities (employment and hours) during 
the week prior to the interview. Because these interviews 
tend to be scheduled at the convenience of the respondent, 
the respondent's activities during the reference week will be 
correlated with the probability of interviewing the 
respondent about that reference week. The intuition behind 
this correlation is exactly the same as that in Example 1. 
This correlation introduces bias into hours-worked esti- 
mates, although the direction of the bias is indeterminate. 
Hours worked per week tend to be overestimated for 
respondents who were unable to schedule an interview 
because of a heavy work schedule, and tend to be under- 
estimated for respondents who were away on vacation. 
Activity bias is also an issue for travel surveys. Time spent 
away from home will tend to be overestimated if 
respondents are asked about, say, the four weeks prior to 
the interview. Asking respondents about a fixed reference 
period can eliminate this bias. 

It is worth noting that noncontact bias is a more 
important consideration for time-use surveys than for other 
surveys, because, unlike most other surveys, time-use 
surveys cannot accept proxy responses. If proxy responses 
could be accepted then data on HTC individuals could be 
collected from proxies, who may be easier to contact. This 
would weaken the correlation between the individual’s 
activities and the probability of collecting data about that 
individual. 

The rest of the paper is organized as follows. In section 
2, four contact strategies are introduced, and simple 
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simulations are used to assess the activity bias associated 
with each strategy. In section 3, the simulations are 
augmented with data from the May 1997 Work Schedule 
Supplement to the Current Population Survey and the 1992- 
94 University of Maryland Time Diary Study, and how the 
bias varies by specific activity is examined. In addition, the 
overall bias is decomposed to assess the relative contri- 
bution of activity bias and noncontact bias. Section 4 
summarizes these results and makes recommendations. 


2. CONTACT STRATEGIES, CORRELATED 
ACTIVITIES, AND ACTIVITY BIAS 


In this section, the activity biases associated with the 
convenient-day schedule and each of the three variants of 
the designated-day schedule are compared. These schedules 
are defined as follows: 


1. Convenient day (CD): Attempt to contact potential 
respondents every day following the initial contact 
attempt until contact is made or until the field period 
ends. 


2. Designated day (DD): Attempt to contact potential 
respondents only once (no subsequent attempts). 


3. Designated day with postponement (DDP): Attempt to 
contact potential respondents on the same day of the 
week as the initial attempt until contact is made or 
until the field period ends (as recommended by Kalton 
1985). 

4. Designated day with postponement and substitution 
(DDPS): Attempt to contact potential respondents 
every other day following the initial contact attempt 
until contact is made or until the field period ends. 


The DDPS schedule assumes alternating Tuesday/ 
Thursday and Wednesday/Friday contact days. Whether the 
first week is Tuesday/Thursday or Wednesday/Friday 
depends on the start day, which is randomly assigned. 

As seen in Example 1, it is straightforward to show that 
a convenient-day schedule can introduce activity bias into 
time-use estimates when the base contact probability is the 
same each day (0.5) except for random noise (+0.5 with 
probability % or -0.5 with probability 12). Even though 
Stewart (2000) shows that Monday through Thursday are 
very similar on average, it is likely that the contact probabi- 
lities for some individuals vary systematically by day each 
week. For example, some individuals may be hard to 
contact on Monday, Wednesday, and Friday of each week. 
This systematic variation makes it considerably more 
complicated to determine whether sample estimates are 
biased, and to determine the direction and extent of that 
bias. One could model contact strategies and analytically 
solve for the bias under different assumptions about the 
pattern of contact probabilities. However, this is a cumber- 
some process, because each assumption about the pattern of 


contact probabilities across days would require a separate 
solution. In contrast, computer simulations are an ideal way 
to assess the bias associated with alternative contact 
strategies under different assumptions about the pattern of 
contact probabilities. The computer program is simpler and 
produces more intuitive results than the analytical solution. 
And it is easy to modify the program to allow for different 
patterns. In section 3, realism is added to the simulations by 
incorporating real time-use data — something that would be 
impossible to do when taking an analytical approach. 


Simulations 


The simulation strategy was very straightforward. First, 
four weeks worth of “data” for each of 10,000 potential 
respondents was created. In order to focus on contact 
strategies, the sampling procedures are ignored and it is 
assumed that the sample of potential respondents is 
representative of the population. The simulations are 
designed to compare the four contact schedules above, so 
it is assumed that the “week” is five days long. Eligible 
diary days were restricted to Monday through Thursday, 
because, as noted above, these days are the most similar to 
each other. The next step was to simulate attempts to 
contact these respondents using the four contact schedules 
described above. Finally, the estimates generated using each 
schedule were compared to the true sample values. 

To simplify the simulations I abstracted from specific 
activities, as in the examples above, and characterized each 
day using an activity index, I,, (J = H, E) that ranges from 
0 to 1. The activity index is given by I, = 1 - P, where Py 
is the probability of contacting and interviewing the 
respondent. To simulate the variation in activities across 
days, the contact probability on a given day is: 


PPSip met 


where P; is the average contact probability on an HTC 
(J = H) or an ETC (J = E) day, and e ~ U(- &, @). Iassume 
that P,,<P,, which means that, on average, respondents 
are less likely to be contacted on HTC days than on ETC 
days. To insure that contact probabilities lie in the [0,1] 
interval, I set € so that €< min (P,,, 1 - P,). 

There are many assumptions one can make regarding the 
pattern of activities across days. The simplest case is where 
all days are identical except for random noise. But as noted 
above, it is possible that potential respondents are systema- 
tically harder to contact on some days than others. To cover 
a wide range of activity patterns, the simulations were 
performed under the following eight assumptions about the 
pattern of HTC and ETC days in each of the four weeks: 


1. Actual values of the activity index are distributed as 
U(0,1), so that the average value is 0.5. 

2. The first two days of every week are HTC and the last 
three days are ETC (HHEEE). 

3. The first three days of every week are HTC and the 
last two days are ETC (HHHEE). 
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4. The first four days of every week are HTC and the last 
day is ETC (HHHHE). 

5. The first day of every week is ETC and the last four 
are HTC (EHHHH). 


6. The first two days of every week are ETC and the last 
three are HTC (EEHHH). 


7. The first three days of every week are ETC and the last 
two are HTC (EEEHH). 


8. For half the sample Monday, Wednesday, and Friday 
are HTC and Tuesday and Thursday are ETC 
(HEHEH). For the other half of the sample the reverse 
is true (EHEHE). 


In pattern 1, the base probability of contacting the 
respondent is the same, so that all of the variation in 
probabilities is due to the random term. In patterns 2-7, 
HTC days are grouped together either at the beginning of 
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the week or at the end of the week. And in pattern 8, the 
base probabilities alternate between HTC and ETC days. 
To focus on activity bias, separate simulations were 
performed for each of the 8 patterns described above. Thus, 
within a simulation all individuals have the same pattern of 
base probabilities. 

Table 1 shows the results from a representative subset of 
the 153 simulations performed. The first four columns show 
the average contact probability on HTC and ETC days, the 
value of &, and the true average activity index. The 
remaining columns contain estimates of the bias associated 
with the four contact schedules. The bias was computed as 
the difference between the estimated amount of time spent 
in each activity and the true amount of time spent in each 
activity, and then the difference was expressed as a 
percentage of the true value. Entries with an asterisk 
indicate that the bias is statistically different from the zero 
at the 5% level. 


Table 1 
Activity Bias Associated with Each Contact Strategy Under Alternative Assumptions About the Correlation 
of Activities Across Days 


a  ——————————————————e—E—E—E—e————e—seeeee 


Sh Estimated Bias (Expressed as a 
eyctage Contack Erobebility percent of the true activity index) 


Activity Pattern Hard-to-contact days Easy-to-contact days é True Average Activity Index CD DD DDP DDPS 
Identical Base Probabilities 
0.50 0.10 0.500 OmF -0.1 0.0 0.1 
0.50 0.30 0.500 Syohime clive) 0.1 0.2 
0.50 0.50 0.500 15.145 -0.9 0.4 0.7 
Grouped Base Probabilities 
HHEEE 0.75 0.25 0.05 0.500 OJ LO -4.7*  -13.8* 
0.75 0.25 0.25 0.500 D2 -LOe -4.8*  -13.9* 
0.60 0.40 0.05 0.500 -0.1 =22* EO mn-2.8" 
0.60 0.40 0.20 0.500 DD Sys -2.6* =O VE2IS* 
HHHEE 0.75 0.25 0.05 0.625 -2.7* -9.7* Oa eaical Dea fi 
0.75 0.25 0.25 0.625 O08 S9-103% AES F12:8* 
0.60 0.40 0.05 0.550 -0.4* -1.8* ENG" e205" 
0.60 0.40 0.20 0.550 1.9% -2.4* -0.5 =D 
HHHHE 0.75 0.25 0.05 0.750 0.1 -0.1 0.1 0.0 
0.75 0.25 0.25 0.750 Bsa} -0.5 0.2 0.2 
0.60 0.40 0.05 0.600 0.1* 0.0 0.0 0.0 
0.60 0.40 0.20 0.600 iL Ghe -0.3 0.2 0.2 
EHHHH 0.75 0.25 0.05 0.625 is 1.0 1.4* 0.7 
0.75 0.25 0.25 0.625 4.2* -0.3 2s 0.7 
0.60 0.40 0.05 0.550 iets 0.3 OS% 0.3 
0.60 0.40 0.20 0.550 2% 0.0 0.6* 0.4 
EEHHH Os 0.25 0.05 0.500 SUS Ae t/. 1 oo ta IA lel (a 
0.75 0.25 0.25 0.500 =15:9* © =1'7.9* -4.5*  -20.9* 
0.60 0.40 0.05 0.500 -2.0* -2.2* -0.4 -2.6* 
0.60 0.40 0.20 0.500 -0.4 -2.4* -0.3 -2.6* 
EEEHH O75 0.25 0.05 0.375 -16.6*  -17.6* Sa 20s 
0.75 0.25 0.25 0.375 -11.4*  -17.6* = 5059.67 
0.60 0.40 0.05 0.450 -2.0%* -2.3* -0.4 -2.5* 
0.60 0.40 0.20 0.450 0.0 =D -0.5 25% 
Alternating Base Probabilities 
HEHEH/EHEHE 0.75 0.25 0.05 0.500 BUS 26.4% OIGh way 285m 
0.75 0.25 0.25 0.500 S4y/ ee 26m Nicene 29 4. 
0.60 0.40 0.05 0.500 5.6* 4.5* 13% Sie 
0.60 0.40 0.20 0.500 TBS 4.3% thy De Sale 


2 ey ee ee ee Se ee 


Note: Asterisks indicate that the estimated average activity index is statistically different from the true value at the 5% level. 
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Pattern 1 — Identical Base Probabilities with Random 
Noise 


This pattern is essentially the same as in the numerical 
example above. The main result is that all of the contact 
schedules generate unbiased estimates for the average 
activity index, except the CD schedule. As expected, the 
CD schedule overestimates the average activity index. 
More importantly, when using the CD schedule, the 
estimated average activity index — and hence the bias when 
activities are uncorrelated across days — is positively 
correlated with the variance of ¢. As the variance increases 
from 0.003 ( € = 0.1) to 0.083 ( € = 0.5), the bias increases 
from less than 1% to 15%. One can see the intuition behind 
this result by noting that a large negative realization of ¢ on 
a particular day makes it less likely that the respondent will 
be contacted on that day, and hence, more likely that that 
day will become the diary day. None of the other contact 
schedules are sensitive to the variance of ¢. 


Patterns 2-7 - Grouped Base Probabilities 


The results are mixed when HTC days are grouped at 
either the beginning or the end of the week. In the 
simulations where P,, - P,, is relatively small (0.2), all of 
the contact schedules perform reasonably well. The 
absolute value of the bias is less than 3% in all cases. 
However, when P,, - P,, is relatively large (0.5), there are 
significant differences in the bias associated with each 
contact schedule. The DDP schedule performs the best 
overall. The bias exceeds 5% (in absolute value) only in 
pattern 7 (EEEHH), for which the bias is — 5.5%. In 
contrast, when using the DD and DDPS schedules, the bias 
is in the 10 — 14% range in patterns 2 (HHEEE), 3 
(HHHEE), and in the 16-20% range in patterns 6 
(EEHHH), and 7 (EEEHH). The differences between the 
DD and DDPS schedules and the DDP schedule for these 
patterns are significant, both statistically and in practical 
terms. In patterns 4 (HHHHE) and 5 (EHHHH) the DDP 
schedule performs slightly worse than the DD and DDPS 
schedules, but the bias is so small (less than 1.5%) that the 
difference is of no practical significance. The CD schedule 
fares somewhat better than the DD and DDPS schedules. 
The bias is less than 5%, except in patterns 6 and 7 where 
the bias is in the 11 — 18% range. As in pattern 1 above, the 
estimated average activity index increases with the variance 
of ¢ under the CD schedule, but not under any of the other 
schedules. And as can be seen from Table 1, in patterns 
where the bias is negative (patterns 6 and 7), an increase in 
the variance of € decreases the bias. 


Pattern 8 — Alternating Base Probabilities 


All of the contact schedules generate biased estimates, 
because ETC days are undersampled. As above, all of the 
schedules perform reasonably well when Be ak is 
relatively small. The bias is in the 5-8% range for all 


schedules except DDP, for which the bias is about 1 %. 
However, when P,-P,, is large, all of the contact 
schedules generate significant bias. The bias of about 10% 
for the DDP schedule is higher than for the other patterns 
but it is smaller than the 25-35% bias for the other 
schedules. Again, these differences are significant statis- 
tically, and they are significant in practical terms. 

The reason that the DDPS schedule generates a large 
activity bias is that contact attempts are made on two HTC 
days and then on two ETC days (or the reverse). This 
pattern results in contacting respondents on a relatively 
large fraction of ETC days, and hence, diary days will be 
disproportionately HTC days. Not surprisingly, if the DDPS 
schedule is modified so the respondent is contacted on the 
same two days each week, there is virtually no bias. 


It is clear from these simulations that the activity bias 
associated with each contact schedule depends on the 
pattern of activities across days, the contact probabilities on 
HTC and ETC days, and the variance of those probabilities. 
However, it is also clear that the DDP schedule outperforms 
the other schedules regardless of the pattern assumed. If 
each pattern is viewed as a different type of respondent, 
then the overall bias (which includes both activity and 
noncontact bias) depends on the relative frequency of each 
type in the population. Information on the incidence of each 
type would allow one to measure the overall bias, and, for 
each strategy, decompose the overall bias it into the portion 
due to activity bias, and the portion due to noncontact bias. 
This is investigated in the next section. 


3. AUGMENTED SIMULATIONS 


If one is willing to make some additional assumptions, it 
is possible to augment the simulations using data from other 
sources. The first assumption is that individuals’ work 
schedules are a reasonable proxy for the patterns of HTC 
and ETC days, so that work days correspond to HTC days 
and nonwork days correspond to ETC days. The second 
assumption is that it is possible to replicate an individual’ s 
week by taking one day from each of five individuals. 

Data from the May 1997 Work Schedule Supplement to 
the Current Population Survey (CPS) were used to obtain 
information about individuals’ work schedules. Note that 
because of the need to know the prevalence of each type of 
schedule for the entire population, nonworkers were also 
included. Table 2 shows the patterns of work (W) days and 
nonwork (N) days from the May 1997 CPS. Approximately 
88% of all individuals fall into two patterns. Forty-eight 
percent work all five weekdays, and 39% do not work any 
weekdays. Another 4% work four weekdays and have either 
Friday or Monday off. The remaining individuals do not 
exhibit any discernible pattern. To simplify the simulations, 
it was assumed that individuals either worked all 5 
weekdays (workers) or that they did not work any weekdays 
(nonworkers). 
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Table 2 
Distribution of Work Schedules 
Activity Pattern 

M Tu W Th F Percent Cumulative Percent 

- - = = = 39.40 39.40 
Ww W W W WwW 48.11 87.51 
WwW W WwW WwW - 2.63 90.14 

- W W W W 1.63 91.77 
W W WwW - - 0.81 92.58 
W W - - = 0.26 92.84 

- - - W W 0.37 93.21 

- - W W W 0.68 93.89 
WwW - W - W 0.49 94.38 

- W - W - 0.25 94.63 

- - - - W 0.51 95.14 
W - - - = 0.25 95.39 
Ww W - W W 0.73 96.12 
Ww - - W 0.36 96.48 
Ww - - W W 0.70 97.18 
Other patterns 2.82 100.00 
Total 100.00 


Note: A “W” indicates a workday, and a “-” indicates a nonwork day. 
Author’s tabulations from the May 1997 Work Schedule Supplement 
to the CPS. Observations were weighted using supplement weights. 
The sample size is 89,746 observations. 


To generate information on individual activities, data 
from the 1992-94 EPA Time Diary Study, conducted by the 
University of Maryland were used. This dataset contains 
time-diaries for a sample of 7,408 adults (see Triplett 1995). 
Because each individual was interviewed only once, there 
is only one observation per person. The following repeated 
sampling method was used to construct 8 weeks worth of 
data for a sample of 18,974 “individuals.” The diary data 
were divided into workdays and nonwork days. A diary day 
was considered a workday if the individual did any paid 
work during the day. Workdays were assigned to workers 
and nonwork days were assigned to nonworkers. Mondays 
were drawn from Monday observations, Tuesdays were 
drawn from Tuesday observations, etc. No observation was 
used more than once for a given individual, but the same 
observation could be used for more than one individual. 
The final sample proportions look fairly similar to the 
proportions from the CPS. Fifty-eight percent of individuals 
in the final sample were workers and 42% were non- 
workers, which is reasonably close to the ratio of workers 
to nonworkers (1.38 vs. 1.23) in the CPS. 

To compute the contact probabilities, it was necessary to 
make a third assumption. Following Pothoff, Manton, and 
Woodbury (1993), the contact probability was assumed to 
be equal to the number of minutes spent in activities done 
at home (excluding sleeping) divided by the time spent in 
all activities other than sleep. This process for generating 
contact probabilities has two important properties: (1) the 
contact probability for a given day is related to the activities 
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done on that day, and (2) one group of potential 
respondents (workers) has a lower average probability of a 
productive contact (0.36 vs. 0.72). 

Tables 3a and 3b summarize the bias estimates from the 
augmented simulations. Table 3a shows the bias estimates 
assuming a 4-week field period, and Table 3b shows the 
same estimates assuming an 8-week field period. Each of 
the first four columns contains estimates of the bias 
associated with the four contact strategies. The entries for 
each strategy and each 1-digit activity include estimates of 
the activity bias for workers and nonworkers, and an 
estimate of the overall bias. The overall bias includes 
noncontact bias, so it is possible that the overall bias is 
larger (or smaller) than the activity bias for either group. 
The bias was computed as in the previous simulations, 
strategy and as before, an asterisk indicates that the bias is 
significantly different from the zero at the 5% level. The 
fifth column shows the true time spent in each activity by 
group and overall. 

Comparing Tables 3a and 3b, we can see that the main 
difference is that, except for the DD strategy for which the 
field period is irrelevant, the overall bias is smaller when 
the field period is 8 weeks. This smaller overall bias is due 
mainly to the increased number of contact attempts, which 
disproportionately increases the probability that workers are 
contacted and makes the sample more representative (see 
Table 4). In contrast, estimates of the activity bias asso- 
ciated with the various contact strategies are not sensitive to 
the length of the contact period. The rest of this discussion 
will focus on the results in Table 3b. 

The DD strategy generated virtually no activity bias. 
There were a few activities — Active Leisure, Entertain- 
ment/Socializing, Organizational Activities, Education/ 
Training, and Active Child Care for workers, and Active 
Child Care for nonworkers — for which the activity bias was 
rather large, but none of these bias estimates are statistically 
significant. The overall bias for the DD strategy is quite 
large for most activities, which, as will be seen below, is 
primarily due to noncontact bias. 


Comparing the other three strategies, one can see two 
patterns emerge. First, activity bias is significantly smaller 
(and generally not statistically significant) when using the 
DDP strategy or the DDPS strategy than when using the 
CD strategy. Second, the bias in the CD estimates follows 
the expected pattern. The bias tends to be positive for 
activities that are done away from home (Active Leisure, 
Entertainment/Socializing, Organizational Activities, 
Education/Training, Purchasing Goods/Services, and Paid 
Work), and negative for activities done at home (Passive 
Leisure, Personal Care, Active Child Care, and House- 
work). This pattern is consistent with research cited in the 
introduction that finds that reported time spent away from 
home is greater under a convenience-day strategy than 
under a designated-day strategy. More important, it is now 
clear that this finding is due to bias in convenient-day 
strategies rather than bias in designated-day strategies. 
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Table 3a 
Estimated Bias — Augmented Simulations (4 Week Field Period) 


Activity/Emp. Status 


Employment Status CD DD 
Passive Leisure 

Nonworkers -8.44* 0.12 

Workers -5.40* 1.07 

Overall -8.62* 13.56* 
Active Leisure | 

Nonworkers 9.80* -2.75 

Workers -0.07 -7,34 

Overall 4.03* 11.75* 
Entertainment/Socializing 

Nonworkers 19.41* -2.01 

Workers 8.63* 7.14 

Overall 13.11* LSe78* 
Organizational Activities 

Nonworkers 19.58* -0.98 

Workers 13.77* 6.95 

Overall 15.24* 15.26* 
Education/Training 

Nonworkers B21 -0.42 

Workers -1.17 7.63 

Overall 19.17% 29:02* 
Personal Care 

Nonworkers -0.50 -0.29 

Workers -0.52* 0.01 

Overall -0.79* 220" 
Purchasing Goods/Services 

Nonworkers 12.62* 1.35 

Workers -4.05 4.62 

Overall 4.67* 22.30* 
Active Child Care 

Nonworkers -7.89* Sys! lI 

Workers -7.69* -6.05 

Overall -9.09* 14.21* 
Housework 

Nonworkers -8.88* LSTA 

Workers -10.55* 0.85 

Overall -11.49* 20H* 
Paid Work 

Nonworkers — — 

Workers 2.95* -0.77 

Overall 6.74* 31.44* 


Time Spent in 


DDP DDPS Activity (Truth) 
-1.54 -1.03 314.72 
0.43 0.82 152.04 
ve yokes 0.38 220.70 
0.99 -0.66 65.94 
-4.69 Pot 26.89 
B eg 1.08 43.37 
-0.25 -1.20 67.30 
5.21 a5 12 27.87 
5.64* 137 44.51 
9.00 3.84 19.25 
VS 7.48 8.72 
L2Bgh 5.99 13.16 
12.54* 8.92% 43.60 
0.57 159 13.16 
{559* 8.00* 26.01 
-0.49 -0.44 663.04 
-0.06 -0.13 580.71 
0.34 -0.15 615.46 
0.11 -1.28 72.98 
-3.62 -5.43* 23.28 
4.25* -1.49 44.25 
-1.06 -0.54 24.13 
-4.09 -0.92 12.64 
0.77 -0.09 17.49 
0.33 227. 169.04 
-2.03 -0.14 ST.92 
4.53* a2 104.82 
0.25 -0.27 536.77 
-7.74* -1.87* 310822 


Note: Asterisks indicate that the bias in the estimated time spent in the activity is significantly different from zero at the 5% level. 


Noncontact Bias 


In general, the contact rate increases and the sample 
becomes more representative as the number of contact 
attempts increases (see Table 4). The contact rate is the 
lowest under the DD strategy (40%), and the sample is the 
least representative. Under both the DDP and the DDPS 
schedules, the contact rate increases and the sample 
becomes more representative as the field period increases 
from 4 to 8 weeks. Using a DDPS schedule with an 8-week 
field period (16 contact attempts) results in a contact rate of 
80% and a representative sample. Not surprisingly, the 
sample generated by the DDP schedule with an 8 week field 
period is virtually identical to the one generated by the 
DDPS schedule with a 4 week field period. 


Activity Bias vs. Noncontact Bias 


To get a clearer picture of the contribution of each type 
of bias to the overall bias, the overall bias was decomposed 
into the portion due to activity bias, the portion due to 
noncontact bias, and the portion due to the interaction 
between the two biases. The overall bias for activity a and 
group g (workers or nonworkers) is given by: 


PX mE; ees = 
Fy (Xyg ~ Xap) + Xag(F,-Fy) + (F,—F, (Xap - Xap) 
Interaction 


Activity + Noncontact + 
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Table 3b 
Estimated Bias —- Augmented Simulations (8 Week Field Period) 
Activity/Emp. Status Time Spent in 
Employment Status CD DD DDP DDPS Activity (Truth) 
Passive Leisure 
Nonworkers -8.63* -0.09 -1.62 -1.21 315.38 
Workers -5.24* 1.28 0.39 1.10 TSt72 
Overall -8.72* ero * -0.35 -0.31 220.79 
Active Leisure 
Nonworkers 10.62* -2.03 1.76 0.06 65.46 
Workers 0.00 -7.29 -3.50 2:21 26.87 
Overall 4.49* 0 0.50 0.82 43.16 
Entertainment/Socializing 
Nonworkers 1977= -1.72 -0.15 -0.91 67.10 
Workers 8.09* 6.64 doz 2.76 28.00 
Overall 13.06* 15.80* 2.47 0.40 44.50 
Organizational Activities 
Nonworkers 18.92* -1.53 8.59 Bi25 19.36 
Workers 14.03* 7.00 3.18 7.25 Sa 2 
Overall 14.89* 14.88* 7.14* 4.76 1324 
Education/Training 
Nonworkers 33.56* 0.18 12.91* 95% 43.34 
Workers -0.72 8.24 O77 2.01 13.09 
Overall 19.73* 22.14* 10.29* 7.32* 25.86 
Personal Care 
Nonworkers -0.50 -0.29 -0.48 -0.44 663.03 
Workers -0.55* 0.00 -0.08 -0.16 580.81 
Overall -0.82* 2.20* -0.17 -0.29 615.51 
Purchasing Goods/Services 
Nonworkers 12.64* 1.36 -0.09 -1.28 712.97 
Workers -4.41 4.23 -3.66 =a 23.36 
Overall 4.48* jai eg Tala -0.42 -2.58 44.30 
Active Child Care 
Nonworkers -7.67* 5.36 -1.04 -0.31 24.07 
Workers -8.02* -6.18 -4,98 -1.65 12.66 
Overall -9.14* 14.30* 2.23 -0.89 17.48 
Housework 
Nonworkers -9.02* Ss 0.20 2.10 169.30 
Workers -10.55* 0.80 -2.15 -0.20 57.95 
Overall -11.64* 20.637 0.17 1.34 104.94 
Paid Work 
Nonworkers — — — — — 
Workers 2.96* -0.78 0.30 -0.26 536.82 
Overall 6.86* -31.44* -0.86 -0.22 310.25 
Note: Asterisks indicate that the bias in the estimated time spent in the activity is significantly different from zero at the 5% level. 
Table 4 obtained by summing this expression over workers and 
Contact Rate Summary — Augmented Simulations nonworkers, and is given by: 
Field hat * * 
Period CD DD DDP DDPS Tnth Voluspa 4 Ua ie, om = por Oe) 
4weeks Contact Rate 89.68 40.35 71.79 78.39 Sate even 
Percent Nonworkers 40.08 60.07 46.82 43.14 42.21 Pa Se ae) 
Percent Workers 59.92 39.93 53.18 56.86 57.79 Shit b : 
8 weeks Contact Rate 89.79 40.35 78.87 80.17 is ry ae - F, (Kae * X59), 
Percent Nonworkers 40.02 60.07 42.88 42.19 42.21 eo 
Percent Workers 59108 S993 57.1209 57:81— 57-79 


there are several things to take from these decompositions 
(shown in Table 5). First, under the CD schedule, all of the 
where F, is the fraction of the sample in group g, and X,, overall bias is due to activity bias. The large number of 
is the time spent in activity a by group g, and asterisks contact attempts virtually guarantees a representative 
indicate the true values. The total bias for activity a is sample, so that increasing the field period from 4 to 8 weeks 
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does not make much difference. In contrast, noncontact bias 
accounts for all of the bias under the DD schedule. Under 
both the DDP schedule and the DDPS schedule there is 
virtually no activity bias, and noncontact bias decreases 
dramatically as the field period is increased from 4 to 8 
weeks. Not surprisingly, the noncontact bias for the DDP 
schedule with an 8-week field period is about the same as 


the noncontact bias under the DDPS schedule with a 
4-week field period. In these simulations, the sample 
becomes fully representative when the field period is long 
enough to allow 16 contact attempts. Finally, the small 
magnitude of the interaction terms reflects the fact that 
activity and noncontact biases associated with each contact 
strategy are negatively correlated. 


Table 5 
Bias Decomposition — Augmented Simulations 


4 — week field period 


8 — week field period 


Total Bias Activity Noncontact Interaction Total Bias Activity  Noncontact Interaction 
Bias Bias Bias Bias 
Passive Leisure 
CD -8.62 -7.23 =eSy/) 0.18 -8.72 -7.29 -1.62 0.19 
DD 13.56 0.50 13.16 -0.10 1351 0.46 13.24 -0.18 
DDP 2.53 -0.75 3.40 -0.11 -0.35 -0.83 0.50 -0.02 
DDPS 0.38 -0.29 0.69 -0.02 -0.31 -0.30 -0.01 0.00 
Active Leisure 
CD 4.03 6.27 -1.92 -0.32 4.49 6.80 -1.96 -0.35 
DD WEIS) -4.40 16.08 0.06 12.30 -3.92 15.97 0.26 
DDP Sel! -1.05 4.15 0.20 0.50 -0.13 0.60 0.03 
DDPS 1.08 0.26 0.84 -0.02 0.82 0.83 -0.02 0.00 
Entertainment/Socializing 
CD 13.11 15.51 -1.89 -0.51 13.06 15.53 -1.92 -0.54 
DD 15.78 1.30 15.82 -1.34 15.80 132 15.69 -1.21 
DDP 5.64 1.72 4.08 -0.17 2.47 1.91 0.59 -0.02 
DDPS 1.37 0.58 0.82 -0.04 0.40 0.42 -0.02 0.00 
Organizational Activities 
CD 15.24 17.36 -1.70 -0.42 14.89 17.06 -1.76 -0.40 
DD 15.26 2.05 14.28 -1.08 14.88 ey? 14.39 -1.23 
DDP 37 8.30 3.69 0.39 7.14 6.53 0.54 0.07 
DDPS 3:99 5.24 0.74 0.01 4.76 4.77 -0.02 0.00 
Education & Training 
CD 1917 22.84 -2.49 -1.18 19.73 2353 -2.56 -1.24 
DD 22.02 1.94 20.90 -0.82 22.74 2.54 20.90 -0.69 
DDP 15.39 9.04 5.40 0.96 10.29 9.36 0.78 0.14 
DDPS 8.00 6.78 1.09 0.13 32 The) -0.02 0.00 
Personal Care 
CD -0.79 -0.51 -0.28 0.00 -0.82 -0.53 -0.29 0.00 
DD 2.20 -0.13 239 -0.06 2.20 -0.13 239 -0.06 
DDP 0.34 -0.26 0.62 -0.02 -0.17 -0.26 0.09 0.00 
DDPS -0.15 -0.27 0.12 0.00 -0.29 -0.29 0.00 0.00 
Purchasing Goods/Services 
CD 4.67 OS) -2.39 -0.49 4.48 7.44 -2.45 -0.51 
DD 22.36 2.34 20.06 -0.04 Dp pe 2.23 20.00 0.00 
DDP 4.25 -1.02 5.18 0.10 -0.42 -1.18 0.75 0.01 
DDPS -1.49 -2.54 1.04 0.01 -2.58 -2.55 -0.02 0.00 
Active Child Care 
CD -9.09 -7.81 -1.40 0.11 -9.14 -7.82 -1.43 0.10 
DD 14.21 0.45 M72 2.04 14.30 0.53 11.66 212 
DDP 0.77 -2.32 3.03 0.07 -2.23 -2.69 0.44 0.01 
DDPS -0.09 -0.69 0.61 0.00 -0.89 -0.87 -0.01 0.00 
Housework 
CD -11.49 -9.42 -2.26 0.18 -11.64 -9.51 -2.32 0.19 
DD 20.77 1.43 18.93 0.41 20.63 Ushi 18.95 0.37 
DDP 4.53 -0.43 4.89 0.08 0.17 -0.55 0.71 0.01 
DDPS 22, 1.50 0.99 0.03 1.34 1.36 -0.02 0.00 
Paid Work 
CD 6.74 295 3.69 0.11 6.86 2.96 5719 0.11 
DD -31.43 -0.77 -30.90 0.24 -31.44 -0.78 -30.90 0.24 
DDP -7.74 0.25 -7.98 -0.02 -0.86 0.30 -1.16 0.00 
DDPS -1.87 -0.27 -1.61 0.00 -0.22 -0.26 0.03 0.00 
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4. SUMMARY AND RECOMMENDATIONS 


Telephone time-use surveys have unique characteristics 
that make data collection more challenging. Unlike most 
other surveys, time-use surveys cannot accept proxy 
responses, so it is more likely that the probability of 
contacting a potential respondent is correlated with his or 
her activities. And because telephone time-use surveys ask 
respondents to report on their activities during the previous 
day, it is possible that the probability of interviewing the 
respondent about a given reference day will be correlated 
with the activities on that reference day. This paper shows 
how these characteristics can generate noncontact bias and 
activity bias. Two sets of computer simulations showed that 
the extent of these biases depends on the survey’s strategy 
for contacting potential respondents. 

In the first set of simulations, it was shown that the 
extent of the bias associated with any given contact 
schedule depends on the pattern of easy-to-contact (ETC) 
and hard-to-contact (HTC) days. The designated-day- 
with-postponement (DDP) schedule outperformed the other 
contact schedules for all of the activity patterns examined. 
These simulations also showed that estimates generated 
using a convenient-day (CD) schedule are sensitive to the 
within-person variance of the contact probability. Estimates 
of the time spent in activities that are positively correlated 
with the contact probability (for example, activities done at 
home) decrease as the variance increases. In contrast, esti- 
mates generated by other contact schedules are not sensitive 
to the within-person variance of the contact probability. 

Given the results of the simple simulations, it is clear that 
the overall bias for the different contact strategies depends 
on the relative frequency of each pattern in the population. 
Direct data on these patterns do not exist, so the first set of 
simulations was augmented using CPS data on work 
schedules and actual time-use data from the 1992-94 EPA 
Time Diary Study. The results from the augmented 
simulations confirm those from the simple simulations, and 
show how the bias can affect estimates of time spent in 
specific activities. As expected, the CD contact strategy 
introduces systematic activity bias into time-use estimates. 
The time spent in activities done at home is underestimated, 
while time spent in activities done away from home is 
overestimated. There is no systematic activity bias in the 
samples generated by the DDP and DDPS strategies. The 
simulations also show that increasing the number of contact 
attempts reduces noncontact bias. 

These results clearly show that the choice of contact 
strategy matters and point to two recommendations. 

First, time-use surveys should use the DDP schedule. 
The DDP schedule generates less activity bias than the 
other contact schedules under all of the activity patterns 
tested. The DDPS schedule performed nearly as well in the 
more common activity patterns. But given that contact rates 
and field costs are a function of the number of contact 
attempts, the DDPS offers no cost advantage over the DDP 
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schedule. Hence, there is no reason to choose the DDPS 
schedule over the DDP schedule. 

Second, time-use surveys need to take steps to minimize 
noncontact bias. Because noncontact bias is largely a 
function of the number of contact attempts, an obvious way 
to minimize noncontact bias would be to increase the 
number of contact attempts. No further elaboration will be 
made on this point, because other authors have looked at 
this issue in depth. For example, Bauman, Lavradas and 
Merkle (1993) show that age and employment status are 
related to the number of callbacks and that additional 
callbacks generate a more representative sample, and 
Botman, Massey and Kalsbeek (1989) propose a method for 
determining the optimal number of callbacks. Another 
alternative would be to try to increase the probability of 
contacting potential respondents. This could be done by 
determining when they are likely to be home and calling at 
those times, or by allowing them to call on their designated 
interview day. Paying incentives is another way to make 
potential respondents become “more available.” A less 
costly approach to minimizing noncontact bias would be to 
adjust sample weights. Pothoff et al. (1993) show that, 
when the variable being measured is correlated (across 
individuals) with the contact probability, weighting based 
on the number of callbacks is practical and effective. In the 
end, the correct mix of these approaches will depend on the 
constraints facing the survey manager. 
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Bias Reduction in Standard Errors for Linear Regression with 
Multi-Stage Samples 


ROBERT M. BELL and DANIEL F. MCCAFFREY’ 


ABSTRACT 


Linearization (or Taylor series) methods are widely used to estimate standard errors for the coefficients of linear regression 
models fit to multi-stage samples. When the number of primary sampling units (PSUs) is large, linearization can produce 
accurate standard errors under quite general conditions. However, when the number of PSUs is small or a coefficient 
depends primarily on data from a small number of PSUs, linearization estimators can have large negative bias. In this paper, 
we characterize features of the design matrix that produce large bias in linearization standard errors for linear regression 
coefficients. We then propose a new method, bias reduced linearization (BRL), based on residuals adjusted to better 
approximate the covariance of the true errors. When the errors are i.i.d., the BRL estimator is unbiased for the variance. 
Furthermore, a simulation study shows that BRL can greatly reduce the bias even if the errors are not i.i.d. We also propose 
using a Satterthwaite approximation to determine the degrees of freedom of the reference distribution for tests and 
confidence intervals about linear combinations of coefficients based on the BRL estimator. We demonstrate that the 
jackknife estimator also tends to be biased in situations where linearization is biased. However, the jackknife’s bias tends 
to be positive. Our bias reduced linearization estimator can be viewed as a compromise between the traditional linearization 


and jackknife estimators. 


KEY WORDS: Complex samples; Linearization; Jackknife; Satterthwaite approximation; Degrees of Freedom. 


1. INTRODUCTION 


Regression analysis of multi-stage samples has become 
very common in recent years (for example, Ellickson and 
McGuigan 2000; Shapiro, Morton, McCafrrey, Senterfitt, 
Fleishman, Perlman, Athey, Keesey, Goldman, Berry and 
Bozzette 1999; Goldstein 1991; Landis, Lepkowski, Ekland 
and Stehouver 1982). Although hierarchical models (Bryk 
and Raudenbush 1992; Gelman, Carlin, Stern and Rubin 
1995, Chapter 13) allow analysis of both fixed and random 
effects, many analysts prefer the simplicity of standard 
regression models when random effects are not of direct 
interest. Standard regression estimators produce unbiased 
parameter estimates that can be efficient, but the default 
standard error estimators do not account for the sample 
design, resulting in inconsistent standard errors (Kish 1965; 
Skinner 1989a). Various methods produce consistent 
standard error estimates applicable when the number of 
primary sampling units (PSUs) is sufficiently large. These 
include sample reuse methods such as the jackknife, boot- 
strap and balance repeated replication as well as linear- 
ization (or Taylor series) methods. 

Linearization (Skinner 1989b) is a nonparametric 
method for estimating the standard errors of design-based 
Statistics such as means and ratios as well as coefficients 
from linear and nonlinear regression models. By non- 
parametric, we mean that linearization does not rest on any 
assumptions about the within-PSU error structure, such as 
an assumption of constant intra-cluster correlation. When 
the number of PSUs can be considered large, linearization 


produces consistent standard errors in the presence of 
multiple features of complex sample designs-stratification, 
multi-stage sampling, and sampling weights-as well as 
heteroskedastic errors (Fuller 1975). Because of these 
desirable properties and its increased availability in 
software such as SUDAAN, Stata, and SAS Version 8.0 
(Shah, Barnwell, and Bieler 1997; StataCorp. 1999; SAS 
Institute, Inc. 1999), linearization has become a common 
method for estimating standard errors and confidence 
intervals and for conducting statistical tests on data from 
complex sample designs (for example, Ellickson and 
McGuigan 2000; Shapiro et al. 1999; Rust and Rao 1996). 
Linearization has also been proposed for estimating 
standard errors from Generalized Estimating Equations 
(GEE) fit to multi-stage data (Zeger and Liang 1986). 

However, the linearization method has limitations. 
When the number of primary sampling units is small, 
standard error estimates can be severely biased low, they 
can have large coefficients of variation, and the standard 
degrees of freedom may be far too liberal (Kott 1994; 
Murray, Hannan, Wolfinger, Baker and Dwyer 1998). 
Consequently, standard linearization inference for coeffi- 
cients based mainly on data from a small number of PSUs 
may produce confidence intervals that are too narrow and 
tests with Type I error rates that are substantially higher 
than their nominal values. Sample reuse methods like the 
jackknife have similar limitations. 

In this paper, we characterize the design factors (i.e., the 
distribution of explanatory variables within and between 
PSUs) that produce large bias in linearization and jackknife 
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standard errors for linear regression coefficients and 
demonstrate that the problem can persist even when the 
number of PSUs is quite large. We then propose an 
alternative to the standard linearization estimator that is 
unbiased for independent, identically distributed (i.i.d.) 
errors and tends to greatly reduce bias otherwise. We also 
present approximate degrees of freedom for use with tests 
and confidence intervals based on our variance estimator. 
Simulation results show improved small sample properties 
of our alternative estimator and test compared with those of 
more traditional methods. Finally, we present an example of 
our methods using data from a national experiment 
evaluating care for depression. 


2. BIAS OF THE LINEARIZATION METHOD 


For simplicity, we restrict consideration in the body of 
this paper to unweighted linear regression for two-stage 
nonstratified samples. Extensions to weighted estimators 
and stratified samples are presented in McCaffrey, Bell and 
Botts (2001) and discussed further in section 8. 

Let n equal the number of PSUs and m, equal the 
number of final sampling units from the i-th PSU, for 
i=1,...,n. The overall sample size is M = ym; We 
assume > that Wa x;,+€,, Where & has mean 0 and 
covariance matrix V, and where Vij Xj , and Ei; all refer to 
the j-th observation from the i-th PSU. We drop the 
standard OLS assumption of i.i.d. errors, assuming only that 
errors from distinct PSUs are uncorrelated. Specifically, we 
assume that V is block diagonal, with m,xm, blocks V, for 
1=1,...,n. In addition to the notation of this model, 
throughout the paper, we let I denote an MxM identity 
matrix and I, equal an m,xm, identity matrix. 

Let f denote the estimated coefficients of the linear 
regression model. To simplify presentation, we generally 
discuss a linear combination of the regression coefficients, 
l’B, for an arbitrary column vector /. For the special case 
where one element of / = 1 and the rest are 0, 1’ 8 equals a 
single estimated coefficient. If errors are uncorrelated 
across PSUs, the variance of 1’ B, is 


eens? 


i=1 


Var(I'B) = I'(X'X)7! (XX a! Lent) 


where X and X, are the design matrices for the entire 
sample and for PSU i, respectively. 
The standard linearization estimator of the variance of 
l’ B is given by: 
= I'(X'X)? oe Pe se a 


i=l 


XX Line (2) 


where r, is the vector of residuals for the i-th PSU. 
Comparison of (1) and (2) shows that linearization simply 
involves estimating V, by a constant c times the outer 
product of the residuals. The constant c is typically set 
equal to n/(n-1), the value used by SUDAAN and the 
Stata svy procedures (Shah, Barnwell, and Bieler 1997; 
StataCorp. 1999). For GEE procedures, Zeger and Liang 
(1986) set c = 1. 

Under fairly general conditions, nv, converges in proba- 
bility to the variance of the asymptotic distribution of 
yn (I' 6 - 1'B) and the relative bias of v, is O(1/n) as the 
number of PSUs gets large (Fuller 1975: Kott 1994). To 
demonstrate convergence for the bias of v,, Kott (1994) 
assumes that the number of observations from every PSU is 
bounded and that elements of (X’X)~! X’ are bounded by 
B/n for a constant B. These assumptions effectively ensure 
that the influence of any PSU on the final estimate dimi- 
nishes as the number of PSUs grows. Convergence of the 
bias of v, holds for heteroskedastic data from stratified 
samples with unequal sampling weights and arbitrary corre- 
lation structure within PSUs. Unfortunately, consistency 
does not guarantee good properties for small to moderate 
numbers of PSUs. 


Theorem 1. When V =07I and c =n/(n- 1), es < 
Var (I'8) with equality if and only if 1’ (X’X) xX; 
constant across i. 


Proof. Without loss of generality, we assume that o” = 1 so 
that V=I. The residual vector r can be written as 
(I -H)s, where H = X(X’X)! X’ is the hat or projection 
matrix for X. Thus, we have that r; = (1 -H);, where 
(I - H), contains the m, rows of (I - H) for the i-th PSU. 
Consequently, 


E(v,) = [—) 1'(X'X)7 


YX) (1-H), E(ee’) (1-H),’X,| (X'X) 1. 


[4 2) roxy 


(Xj X,- Xpxnceexjd ex) (x XG) 


i=] 


because E(¢s’) = J and (I - H),(1- H); = (1, -H,,) for 
H,, = X,(X'X)? X/. Let D, = X/ X,-(1/n) (X' X). Note 
that )’, D, = yx; X,-X' X =0. Thus, 
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E(v,)= (=): (X’X) 
A 


> (X;X,-[(I/n)X'X+D ,] (XX) [(n)X’X+D,]) 


i=1 


(X’X) 


= (2) 1(x'xy! 
n-1 


X’X -(I/n)X’X-> D (X'X)"'D,| (X'X) 11 
i=1 
= roxy 1-2] 1'(x'xy? 
n-1 
3 D(X’ X)'D,| (X'X) 7 
i=l 
=Var (IB) - (— aap Ya; (X'X) 1a, 
(4) 
for a, = D, (X’X) 1 = [Kj X,-(1/n) (X'X)] (X’X) TL 


Because (x! Metis positive definite, E(v,) s Var(l' 6) 
with equality a and only if a,=0, or equivalently, 
X/X,(X'X)” 1] is constant across the i. 

Replication methods do not necessarily avoid the 
problem of bias for regression variance estimators. A 
jackknife estimator for multi-stage samples can be derived 
from the set of pseudo values {B,,,}, estimates of B from 
data that exclude the i-th PSU: 


= [(n- 1) Di, 1'(B,, - B)(B. - p) ©) 


(Cochran 1977; Rust and Rao 1996). If (I, - H,,)"' exists 
for all 7, then 
Vix = ((n-1)/nll’ (XX) "DY, X; d,-H,)” 
rr; 1,-H,,) 'X,;(X'X) "1, (6) 
which follows from the updating formula 
(X'X -X/X,)7 = (X'X) 71 +(X'X) 7X) deny. 
X,(X'X)" 1 (Cook and Weisberg 1982; Bell and 


McCaffrey 2002, page 34). Some authors (Efron and 
Tibshirani 1993) suggest an alternative sid SB, estimator 
with 8 replaced by the mean of the Bi *s in (5). These two 
methods provide very similar pciiesies in our simulations, 
so we discuss only the version based on (5) in what follows. 


Theorem 2. When V = 07 I and (I, - H,,)"' exists for all 
i, then E(V jx) > Var(l' 6) with equality if and only if 
<i X / X, is constant across i (proof in appendix). 

The following example shows that the conditions for 
linearization and the jackknife estimators to be unbiased are 
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very restrictive even for simple linear regression. 
Example 1. Consider simple linear regression. We have 
that 
1 , =] Nhs : % 
X/X (XX) sieges ee 


Ms RPS; X; 


where s? and {s;} are ML estimates for the overall and 
within-PSU variances of x, with divisors M and {m,}, 
respectively. So we have 


Ki XX Kgl 


ee Eek, eet epee, 
Ms (s?+%7)x,-(s; +4, )X s; +x; mXax 


To have v, and v,, unbiased for the slope, i.e., for 
f'=(0,1), we must have that m,(x,-x) and 
m; (s; + x; - x, x) are both constant across i. The former 
implies that x,=x, and together they imply that 
m, s=yy (oA ~ x? is constant. Note that m, need not be 
constant. These two conditions are not sufficient to 
guarantee unbiasedness for 1’ = (0,1), however. Additional 
algebra shows that the bias in the linearization estimator for 
the variance of the slope equals 


bse ce aD 


af OT, a “ i+] i=] 


2 


* Q;; -x)* -ms 


j=l 


Consequently, the bias includes a part that is proportional 
to the weighted variance of the PSU means of x and another 
that is proportional to the variance of the within-PSU sums 
of squares. 

The example shows that when the errors are i.i.d., v, is 
unbiased only under very restrictive conditions. When 
V #I, Theorems 1 and 2 do not hold, and the bias in v, 
can even be positive (see Example 2 of Bell and McCaffrey 
2002). 

In general, v, tends to have negative bias. The estimator 
is the sum over PSUs of squares of linear combinations of 
residuals, c 1’ (X'X)'X/ r,. These sums of squares tend 
to be too small for two reasons: residuals are generally 
smaller than true errors due to overfitting, and residuals 
tend to have lower intra-cluster correlation than the errors. 
The factor c=n/(n-1) corrects completely for these 
problems only in very restricted circumstances like the 
conditions in Theorem 1. 

The bias of the linearization estimator (or the jackknife) 
increases with the between-PSU variance of the explanatory 
variables. Consequently, explanatory variables that are 
(nearly) constant within PSUs tend to exhibit the largest 
bias. When there are several such explanatory variables, 
there can be substantial underestimation of intra-cluster 
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correlations, leading to large bias in estimated variances for 
all the corresponding coefficients. Even greater bias 
potential appears to occur when certain PSUs account for 
most of the variability in the covariates and shave dispro- 
portionate impact on the determination of 1’ f. 


3. THE BIAS REDUCED LINEARIZATION 
METHOD 


Phillip Kott has proposed two methods for reducing the 
bias in linearization. Kott (1994) suggested correcting the 
bias in v, by using the residuals and the design matrix to 
estimate ie negative of the bias of v, by R (R>0, typi- 
cally) and setting v,., =v,/(1- -Riv er Kott suggested the 
estimator v,,, rather than the more obvious (v, + R) as 
ad hoc compensation for the relative bias in RB as an 
estimator of the true negative bias, R 

In his 1996 paper, Kott suggests calculating the ratio of 
Var (1' B) to E(v,) under the assumption that V =I and 
adjusting v, by the ratio. If V =I then the resulting esti- 
Mmator Vyo¢ sail be unbiased. 

In the context of generalized estimating equations, 
Mancl and DeRouen (2001) take a different approach to 
correcting the bias in the linearization estimator. They 
suggest adjusting the residuals from each PSU to reduce the 
bias in rr, as an estimator of V ,. For the unweighted 
linear model given in section 2, they approximate E(r,r,’) 
by (,-H,)V,d;-H,,) and suggest replacing t, in 
(di, -H, i Ir, equation 2). Thus, for unweighted linear 
models the Mancl and DeRouen estimator equals 
n/(n-1)v ,, and the properties on this estimator follow from 
the properties of the jackknife estimator. 

We present an alternative approach that we first 
proposed in 1997 (McCaffrey and Bell 1997). The method 
is also based on replacing r, in equation (2) with adjusted 
residuals of the form r; = =A, r, intended to act more like 
the true errors ¢.. Like Kott (1996), we derive an estimator 
that eliminates the bias of v , when V equals U, a specified 
block-diagonal covariance matrix, and reduces the bias for 
other V. Like Mancl and DeRouen (2001) we adjust the 
residuals from each PSU. However, using U we derive an 
alternative approximation to the E (r,r,) and our resulting 
estimator is not proportional to the jackknife but rather can 
be seen as a compromise between the linearization and 
jackknife estimators. Our approach is also a generalization 
of the method of MacKinnon and White (1985), who adjust 
individual residuals to produce a_heteroskedastically- 
consistent variance estimator (in the sense of White 1980) 
that is unbiased when the errors are independent and 
homoskedastic. 


Theorem 3. For a specified block-diagonal covariance 
matrix U, consider the class of estimators v= (KX) i 
(Yj. X,/Ajrjr/A,'X,) (X'X) 11, where A, satisfies 


A {(1-H),U(dI- H)! JA’ = U, for i=1,..., n. If V=kU 


for some scalar k, then E(v, .) = Var(I' B). 
Proof. The expected value of v,. is given by 
Ev, +) 


: roexy |S X/ A, (1-H),(kU)(1-H)/A/X, 


i=] 


(X'X) 


¥ X/KUDX, (X’X)1/=Var(I'B). 


i=] 


SN Ne 


Without external evidence to the contrary, an analyst is 
likely to use a working covariance matrix of the form 
U=o°I, which simplifies the condition on A, to 
A ,(1,;-H,,) A,’ =1, or 

A,’ A; = Lets lle (7) 
We set U =I in what follows. 

A solution to equation (7) exists for PSU i whenever 
(1, -H,,) is full rank, which is true if all the eigenvalues of H;; 
are strictly less than 1 (the eigenvalues of H,, are always 
between 0 and 1). An eigenvalue of H,, may equal 1 — e.g., 
when the model includes a dichotomous explanatory 
variable that is one if and only if an observation falls in the 
i-th PSU. 

For m,>1,A,; is not unique. If A, _ satisfies 
AJA, =(I, =Heda}s 1 then so does OA,, for any m, Xm, 
orthogonal matrix O:clf (Missatls the choice of A, is 
unimportant because any solution to (7) will produce an 
unbiased variance estimator. However, the resulting esti- 
mators are biased when V # 0°I, and the bias can vary 
greatly with the choice of A,. Heuristically, it makes sense 
to choose the solution A, “closest” to the identity matrix, so 
as to “mix” the residuals as little as possible. Two 
promising candidates are the Cholesky decomposition of 
€,- Hr 1 which has all 0’s below the diagonal, and the 
symmetric square root of (I1,-H,)7'. Let P be an 
orthogonal matrix whose columns are the eigenvectors of 
(EH) 1 and A be a diagonal matrix containing the 
corresponding eigenvalues of (I,-H,,)”, 1 so. that 
(I,-H,,)"'=PAP’. Then for A!” equal to the elementwise 
square root of A, PA!2P’ is symmetric and solves (7). In 
contrast, multiplying either of these two solutions by a 
random orthogonal matrix could greatly distort the 
residuals. 

Among the class of adjusted residuals of the form A,r, 
where A , satisfies (7), those based on the symmetric square 
root of dL, -H,,)',r; = PA!’P’r,, are “best” in the sense 
of Theil (1971) —i. ek they minimize the expected sum of 
the squared differences between the estimated and true i.i.d. 
errors (see pages 36-37 of Bell and McCaffrey 2002 for 
details). When there is intra-cluster correlation, simulation 
results in section 6 suggest that the bias of v, based on the 
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symmetric square root is greatly reduced compared with 
that of the traditional linearization estimator, v,. For these 
reasons, we consider only the symmetric root in the 
remainder of the paper and refer to the estimator using this 
root as the biased reduced linearization estimator, v,., - 

As Kott (1994) proved for v,, if the number of units in 
every PSU is bounded and the élements of (X'X)!X’ are 
bounded by B/n for some constant B_  (ie., 
(X'X)'X’ = O(1/n)), then the bias in v,,, is O(n *) and 
the relative bias is O(1/n) (Bell and McCaffrey 2002, page 
15). 


4. VARIANCE OF THE ESTIMATORS AND 
TESTING 


We note that v,,v 
form 


v* =cl'(X'X)1Y,X/'A,rjr/A,X,(X’X), 


BRL? and v,, can all be written in the 


where: c=n/(n-1),1, or (n-1)/n, respectively, and 

=i i. Hy), or (i, -i.)., tespectively. Enis 
formulation of the estimators shows that v,,, can be 
viewed as a compromise between v, and v,,, chosen to 
offset their opposing biases. 


JK? 


Theorem 4. Let the error terms be distributed as 
multivariate normal with mean 0 and nonsingular 
covariance matrix V. Then for any variance estimator of the 
form 


v* =cl’ (X'X)1Y, XA, r/ A,X KX), 


v” equals the weighted sum of independent ra random 
variables ape the weights are the eigenvalues of the n x n 
matrix for ={g; Veg}, for g, =c'?(1-H); 
A,X ,(X'X)” | ee in appendix). 

We can write v, as a quadratic form y’G* y, where the 
M-by-M matrix G* =)", g,g/, so that v, is a weighted 
sum of independent chi-square random variables with 
weights equal to the eigenvalues of G*V. The proof 
consists of showing that the nonzero eigenvalues of G* V 
equal the nonzero eigenvalues « of G. 

The mean and variance of v * are simple functions of the 
eigenvalues of G, namely E(v*)=Y7_ A, Ey; *) = yj-14, and 
Var(v*) =." a; Var(u;) => i If V=o°I and 
XX ,(X'X)” 1] for i=1,...,n are constant, conditions for 
in and Vx to be unbiased, then Theorem 4 implies that 
aV,,4Vj,., and dvpp, are all distributed (,- , for 
a=(n-1)/Var(l’ B) (Bell and McCaffrey a pages 
41-42). However, in general, the X ; X ,(X’ X)" 1] will not be 
constant and the squared coefficient of variation, will exceed 
2/(n-1), the corresponding statistic for a 1, , random 
variable. 

This excess variability is of particular concern when 
considering reference distributions for testing the null 
hypothesis that /’B =0, with test statistics of the form 
t=I'Bi/v". For v,, Shah, Holt and Folsom (1977) 
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suggested comparing ¢ to a reference f-distribution with 
n-1 degrees of freedom, which is now the default in Stata 
(Stata Corp. 1999), SUDAAN (Shah, Barnwell and Bieler 
1997) and SAS (SAS Institute 1999). The choice of n - 1 

degrees of freedom is motivated by the fact that v, can be 
written as the sum of squares of n random variables 
c'1'(X'X) 'X'r, However, because the variance of 
(n-1)v,/E(v,) tends to be greater than 2(n — 1), tests that 
use a f-distribution with n-1 degrees of freedom would 
tend to have Type I error rates that exceed the nominal 
value, even if v, were unbiased. 

Satterthwaite (1946) suggested approximating the distri- 
bution of a linear combination of x; variables by iy (up to 
a constant) where the first two moments of the linear 
combination match those of Fs We would Ses 
Vis Vgpr OF Vy bya Xi; where f = 2/cv? =(¥)7_,4,)/Y5-1 9; 
and the i, are the eigenvalues of the corresponding matrix 
G. Tests based on reference t-distributions with f degrees of 
freedom would be expected to provide better Type I error 
rates than tests based on n - 1 degrees of freedom. Rust and 
Rao (1996) also suggest using a Satterthwaite approxi- 
mation to estimate the degrees of freedom for the jackknife 
estimator. They present results for the estimator of a mean, 
while Theorem 4 extends this approach to testing linear 
combinations of regression coefficients. Kott (1994, 1996) 
suggests using the Satterthwaite approximation to estimate 
the degrees of freedom for tests based on his alternatives to 
linearization. 

The coefficient of variation for any of the nonparametric 
variance estimators can be very large for certain designs. 
High variability occurs under the same conditions that v, 
and v,, are most biased — when residuals from only a few 
PSUs effectively determine the final variance estimate. This 
variability of the estimators is an inherent cost of using 
nonparametric techniques. 

Because the Satterthwaite degrees of freedom f requires 
specifying the unknown matrix V, we have investigated two 
methods for setting V. The first treats V as block-diagonal 
and estimates each block with the outer-product of the 
residuals for the PSU. Because preliminary simulation 
results showed that degrees of freedom based on this 
empirical estimate of V produced tests that were extremely 
conservative, we do not present any simulation results for 
this method. Kott (1994) also found that estimating V for 
use in the formula for estimated degrees of freedom proved 
unsatisfactory. Instead, we used a second method that sets 
V identically equal to the identity matrix — i.e., it assumes 
independent, homoskedastic errors for purposes of deter- 
mining degrees of freedom. 

The distribution of v,,, (and the other variance 
estimators) tends to be less skewed and have less mass in 
the lower tail than the distribution of a %; where f equals the 
Satterthwaite degrees of freedom. Hence, reference 
t-distributions based on the Satterthwaite approximation 
tend to overestimate tail probabilities. For example, when 
data from a couple of PSUs nearly determine the value of a 
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coefficient, the Satterthwaite degrees of freedom can be less 
than two, incorrectly implying a chi-square density that is 
infinite at zero. Consequently, the probability of very large 
t-statistics may not be as large as the Satterthwaite approxi- 
mation would imply, especially when the Satterthwaite 
degrees of freedom are less than 4 or 5. 


5. SIMULATION METHODS 


We use a Monte Carlo simulation to study the properties 
of alternative variance estimators and tests for a balanced 
two-stage cluster sample with n = 20 PSUs anda constant 
m = 10 observations in each PSU. All simulation repli- 
cations use a common design matrix X with four explana- 
tory variables chosen to represent a range of difficulty for 
nonparametric variance estimators. The first two 
explanatory variables, x, and x,, are dichotomous (0 or 1) 
and constant within PSU. The variable x, 1s 1 in half the 
clusters: 1, 3,...,19, while x, is 1 in just three clusters: 9, 10, 
and 11. Both x, and x, were generated from standard 
normal distributions. They differ in that x, was generated 
from a multivariate normal with intra- ieee correlation of 
0.5 within PSU, while x, was generated from independent 
normal distributions. Observed intra-cluster correlations are 
1.00, 1.00, 0.62 and -0.04, respectively. Observed 
correlations among the explanatory variables are all very 
small with the exception of Corr (x,,X,) = 0.14, 
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Corr (x,, x3) =0.25 and Corr(x,, x,) = -0.11. The estimated 
regression coefficients are linear combinations of the 
ee variable with multipliers given by the rows of 
(X’ X) |X’, which are shown in Figure 1. For the first three 
coefficients, and to a lesser extent B,, observations from the 
same PSU tend to have similar multipliers. Of more 
importance, By; By and B, are determined primarily by 
results in a small number of PSUs with relatively large 
multipliers (in absolute value). For example, Figure 1 shows 
that the multipliers for B, are large for the second PSU, 
which has a mean that is over two standard deviations from 
the average PSU mean. In general, variance in the PSU 
means gives some PSUs greater weight for estimating B,. 

The ae variable was generated from the equation 
Viz = =p’ Xipt Eps where B =0 and the e,’s are standard 
EN normal random variables with intra- cluster 
correlation p. We use three alternative values of p =0,1/9, 
and 1/3, corresponding to design effects for the sample 
mean of DEFF 1, 2, and 4, respectively 
(DEFF=1+(m-1)p). Monte Carlo results are based on 
100,000 replications of y for our fixed X. 

We evaluated the ordinary least squares (OLS) variance 
estimator, s*/'(X’X)1J, and five nonparametric variance 
estimators: the standard linearization estimator given in 
equation (2) with c =n/(n-1); the jackknife estimator 
given in (5); bias reduced linearization; and Kott’s two 
adjustments to linearization. BRL and the Kott adjustments 
are all based on working intra-cluster correlations of p = 0. 
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We estimated Type I error rates for eight alternative test 
procedures based on 100,000 replications from the null 
hypothesis where each B.= 0, for k = Oto 4. Each proce- 
dure compares a “t-statistic” against a reference t-distribu- 
tion. For the t’s based on linearization, the jackknife, and 
BRL, we use critical values from t-distributions with both 
(n-1)=19 degrees of freedom and the corresponding 
Satterthwaite approximation. For Kott’s methods, we use 
his proposed degrees of freedom. All computations were 
implemented in SAS. 


6. SIMULATION RESULTS 


Table 1 shows the bias of several variance estimators for 
the five regression coefficients (including the intercept) for 
p =0,1/9, and 1/3. Except for Kott (1994), all values are 
exact based on the X matrix described above. Because Kott 
(1994) cannot be written as a linear functional, its bias is 
estimated from the Monte Carlo simulations, and the 
standard error of the bias is shown in parentheses. 


Table 1 
Bias of Variance Estimators (as a Percentage of the True Variance) 
Estimator By B, B, B, 6 i 
p=0 
OLS 0.0 0.0 0.0 0.0 0.0 
Linearization 20 6 tel 5.2 eo2.5* -13.35 -1.8 
Jackknife 11.7 172 yi 17.6 2 
Kott (1994) 4.0 ps -1.0 gr) 4.7 
(Standard error) O;2ye=8 (0.171. 8 (0.3)F-* (0.2) eNO 
Kott (1996) 0.0 0.0 0.0 0.0 0.0 
BRL 0.0 0.0 0.0 0.0 0.0 
p = 1/9 
OLS =50.2 af) =49.7 3) 250. Ta 37.7 4.1 
Linearization 210.3 ge-214.2 a3 253.295.3217) 1 -2.5 
Jackknife 11.0 16.4 50.1 19.8 32 
Kott (1994) 39 2a -0.8 Be 4.6 
(Standard error) CO) errata (2) OL) 
Kott (1996) -0.8 -1.2 -1.0 -4.4 -0.7 
BRL -0.7 -1.0 -0.8 -1.2 0.1 
p = 1/3 
OLS 5) se coat fs Wes al 3) | «ps get! Nb 13.8 
Linearization -10.7 -14.8  -33.5 -19.9 -4.1 
Jackknife 10.7 15.9 49.5 21.4 5.9 
Kott (1994) 3.6 2.4 -0.6 1.4 4.4 
(Standard error) (Oyen AO Osh. (oye AONE) 
Kott (1996) -1.2 -1.9 -1.5 -7.7 -2.3 
BRL -1.0 -1.5 -1.3 -2.1 0.4 


Note: All values are exact except for Kott (1994), which is based on 
100,000 simulation replications. 


The OLS variances are unbiased for p = 0, but they are 
badly biased for p = 1/9 and 1/3. As discussed in Wu, Holt 
and Holmes (1988), the OLS variances are too small by 
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roughly a factor of 1/[1+p(m-1)ICC,], where ICC, 
denotes the intra-cluster correlation for an x variable. 
Hence, for PSU-level variables (including the intercept), the 
OLS variances are too small by roughly a factor of 1/DEFF. 
Similarly, the bias is smaller, but still substantial for x,, the 
individual-level variable with large intra-cluster correlation. 
The positive bias for the OLS variance of } , results from 
the slight negative intra-cluster correlation for x,. 

Linearization and the jackknife each suffer from large 
biases, relatively independent of p, but the biases point in 
opposite directions. For each estimator, the magnitude of 
the bias varies greatly among the coefficients. The largest 
biases (in absolute value) occur for B,, which depends 
mainly on the data from three PSUs. The next greatest 
biases occur for BS followed closely by B, and Bo. 

Except for B 4» Kott (1994) has much smaller magnitude 
bias than linearization. However, the method tends to over- 
compensate, often resulting in notable positive bias. An 
exception is B.. for which Kott’s estimator remains biased 
low. 

By design, Kott (1996) and BRL eliminate the bias for 
p =0. Consequently, choice among these alternatives 
should rest mainly on how well they hold down bias for 
VI. Both methods reduce the magnitude of bias 
dramatically relative to linearization for p =1/9 and 1/3. 
Although differences between the two methods are often 
small, BRL does uniformly better, with its worst bias being 
-2.1 percent. While Kott (1996) is practically 
indistinguishable from BRL for the PSU-level variables, it 
performs substantially worse for B, and Ba. 

The linearization, jackknife, BRL and Kott estimators 
are highly correlated with similar coefficients of variation. 
For any given regression coefficient, the correlation among 
the variance estimators always exceeded 0.969, with most 
exceeding 0.99 (not shown). The smallest correlations 
tended to be between the jackknife and other estimators. 
The coefficients of variation (also not shown) were largest 
for Kott (1994) and tended to be smallest for linearization 
and Kott (1996) (except for the intercept). For the intercept, 
the jackknife had the smallest coefficient of variation. The 
relative variance of the BRL estimator was similar to that of 
the alternative nonparametric methods. Its coefficient of 
variation was between 1 and 6 percent larger than that of 
the linearization estimator but about 5 to 10 percent smaller 
than that of Kott (1994). Thus, the five nonparametric 
variance estimators tend to differ from each other mainly by 
constant factors, and Table 1 summarizes the main 
difference among these variance estimators. 

Table 2 shows the Satterthwaite degrees of freedom for 
each of the five coefficients for the linearization, jackknife, 
BRL and Kott variance estimators. For all estimators the 
degrees of freedom were calculated assuming V =I and 
consequently depend only on the design matrix and not on 
the values of y. The approximations are similar for linear- 
ization and BRL although the linearization degrees of 
freedom tend to be slightly larger reflecting the fact that for 
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this design matrix the relative variances of the BRL estima- 
tors are marginally larger than those for linearization. 
Kott’s approximation derives the coefficient of variation for 
a linearization-type estimator based on the true errors rather 
than the residuals. As a result, Kott’s approximate degrees 
of freedom, which are larger than those for linearization or 
BRL, tend to overstate the precision of his estimator (see 
Kott 1994, section 6). Across all four estimators, the 
approximations are smallest for B.. 


Table 2 
Degrees-of-Freedom for Selected Estimators 

Method Dea Laeieiia seu Semen 
Satterthwaite (LIN) 9.02 1445 3.30 11.56 16.65 
Satterthwaite 9 52 4th iBO wedO2 99.06 onl6.23 
(Jackknife) 

Satterhwaite (BRL) 9.24 1408 2.90 10.26 16.45 
Kott’s method 10,336. 16.41 224.32 121.36..17.44 


Table 3 shows that Type I error rates for the standard 
linearization method with (n - 1) degrees of freedom con- 
sistently exceed 5 percent for all three values of p. Type I 
errors are most common for pe where they reach as high as 
16 percent, but they also occur much too frequently for 
Bo. B,, and B,, ranging from 7.0 to 8.8 percent. The magni- 
tude of this problem correlates closely with the size of the 
bias of the linearization estimator (see Table 1). Type I 
error rates are much lower, 5.7 to 6.4 percent, for tests 
based on the Satterthwaite degrees of freedom. Thus using 
the alternative degrees of freedom improved the Type I 
error rates by about 30 to 88 percent. 

There is a less consistent pattern for the Type I error 
probabilities for the jackknife. The jackknife with (n - 1) 
degrees of freedom tends to be conservative for B, and B,, 
in accord with the positive bias in the jackknife variance. 
In contrast, the probability of Type I error is much too large 
for B,, and a bit too large in two of three cases for the inter- 
cept By. The apparent explanation is that the choice of 
(n-1) as the degrees of freedom for the reference 
t-distribution sometimes counteracts the bias in the 
jackknife variance. This conclusion is supported by the 
very low Type I error rates for the jackknife with 
Satterthwaite degrees of freedom; smaller degrees of 
freedom combined with large positive biases result in very 
conservative tests. 

BRL with (n - 1) degrees of freedom improves substan- 
tially on linearization with the same degrees of freedom. 
Because BRL is unbiased when p =0, comparing the fifth 
row of the table against the first demonstrates the reduction 
in Type I errors that results from removing the bias of 
linearization. Excluding 4» BRL reduces Type I error rates 
by about 45 to 88 percent. However, BRL with (n-1) 
degrees of freedom remains consistently liberal, especially 
for B.. Comparison of rows 2 and 5 of each section shows 
the relative impact of bias reduction and the Satterthwaite 


adjustment. For By and pm degrees of freedom are more 
important, while bias matters more for B, and 
Performance for BRL with the Satterthwaite approximation 
is very good, except for B,, where the Type I error falls to 
about 3 percent. 


Table 3 
Type I Error Rates for Tests of the Null Hypothesis that B = 0 
Estimator Df By B, B, B, 6 h 
p=0 
Linearization n-1 T5424 E700. TS:99EM TS 5.38 
Linearization Satt 5.75 645 6.33 6.28 5.18 
Jackknife n-1 5.01 S92” 8. 4a) ee 
Jackknife Satt 3.80 3.43 1.41 3.26 4.77 
Kott (1994) Kott 4.87 5.03 he ded 521 4.67 
Kott (1996) Kott 5.11 3.08, 4,85 2.0 46s S00 
BRL n-1 O.28e0 3D eet oO eee 
BRL Satt.: Ads 486.04 3.12. p45) 2 cee Sl 
p=1/9 
Linearization n-1 dod tld AGAD ySAs 5.34 
Linearization Satt 6.03 6.60 6.43 7.05 5.14 
Jackknife n-1 531 4.06» 1 F632 Ad IATT 
Jackknife Satt 4.11 3.61 4S 3.245 on 
Kott (1994) Kotte 3:07 75:03 TRO TRS OL 4.56 
Kott (1996) Kott’ 55.42% 25.28% 9 514 “532s Ol 
BRL n-1 6.522 945509) ARIF i623 5.08 
BRL Satt 5.04 500 3.19 4.93 4.84 
p =1/3 

Linearization n-1 8.10 | 7.28 © H1G.39 8.79 5.66 
Linearization Satt 630 6.78 6.62 7.53 5.44 
Jackknife n-1 5.45 4.11 7.76 456 4.67 
Jackknife Satu Fats 3.61 1.51 3.35 4.46 
Kott (1994) Kott 5.14 506 7.02 5.80 4.84 
Kott (1996) Kotte ss 590544 WSs 14s Stes woe 
BRL n-1 6.16 2— S630 TESS 6.45 3.19 
BRL Sati” SAS '™ 5145 3.308) 25.265 ae 


Note: Entries with a true value of 5.00 percent have standard errors 
of 0.07 percent. 


Tests based on Kott’s 1994 estimator with his proposed 
degrees of freedom perform very well for the coefficients 
where the variance estimator is biased upward. It appears 
that the upward bias in the variance estimator is offset by 
the upward bias in the approximate degrees of freedom. 
Kott’s variance estimator is slightly negatively biased for 
B, and therefore the upward bias in the degrees of freedom 
compounds the bias in the estimator resulting in a Type I 
error rate of about 7 percent for all three values of p. 


Tests based on Kott’s 1996 estimator also perform well. 
For almost all the coefficients and all values of p the Type 
I error rate is close to 5 percent. The exception is the test for 
B, when p = 1/3, which has an error rate of 5.88 percent as 
a result of the moderate bias in the variance estimator. 
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7. EXAMPLE FROM THE PARTNERS IN CARE 
EXPERIMENT 


We illustrate the methods in this paper using data from 
Partners in Care, a longitudinal experiment assessing the 
effect of “quality improvement” programs on care for 
depression in managed care organizations (MCOs) (Wells 
et al. 2000). The experiment followed 1356 patients who 
screened positive for depression in 1996-1997 in 43 clinics 
of seven MCOs. Clinics were assigned at random to one of 
three experimental cells: usual care, a quality improvement 
program supplemented by resources for medication 
follow-up, or a quality improvement program supplemented 
by resources for access to psychotherapists. Clinics were 
assigned at random after forming 27 clinic sets—three for 
each of nine blocks (six MCOs constituted single blocks, 
and one MCO was divided into three blocks based on 
ethnic mix of the clinics). Within blocks of more than three 


ay 


clinics, clinic sets were combined to match as closely as 
possible on anticipated sample size and patient character- 
istics. See Wells et al. (2000) for additional details. 

We present results from an OLS regression on the mental 
health summary score from the SF-12 (Ware, Kosinski and 
Keller 1995) for 1048 patients at 6-month follow-up. 
Scores were standardized to have mean 50 and standard 
deviation 10 in a general population, with higher scores 
indicating better health. As in Wells et al. (2000), the 
explanatory variable of primary interest is an intervention 
indicator that estimates the combined effect of medication 
or therapy versus care as usual. The first two columns of 
Table 4 show OLS coefficients and standard errors for the 
intervention effect and all the covariates used by, but not 
reported in, Wells et al. (2000). Our regression differs from 
theirs because we do not weight for nonresponse or impute 
for missing values of the outcome variable, but the results 
for the intervention effect agree reasonably closely. 


Table 4 
Comparison of OLS, Linearization, and BRL Inference for Partner-in-Care 
Dia hy ie P-value 
Explanatory Variable B, Deo s SEors Sees DE ay OLS LIN BRL 
PSU-Level 
Intercept 28.795 3.409 1.03 1.06 23.1 0.000 0.000 0.000 
Intervention 1.724 0.746 0.73 0.84 15.4 0.021 0.003 0.015 
Block 1 1.386 1.867 0.63 0.80 2:7, 0.458 0.244 0.426 
Block 2 -0.031 1.576 0.88 1.07 3.6 0.984 0.982 0.986 
Block 3 -1.042 1.230 O53 0.61 3.9 0.397 0.117 0.241 
Block 4 0.038 234 0.62 0.73 4.5 0.976 0.961 0.968 
Block 5 -3.707 1.503 0.66 0.78 4.7 0.014 0.001 0.027 
Block 6 -0.025 1.562 1 I) £32 4.9 0.987 0.989 0.991 
Block 7 -2.784 1.644 0.84 0.97 7.0 0.090 0.051 0.126 
Block 8 0.822 233 0.93 1.03 12.0 0.505 0.476 0527 
Demographic 
Black 0.972 1.448 0.74 0.79 7.6 0.502 0.369 0.419 
Hispanic 0.202 1.004 0.73 0.75 24.3 0.841 0.785 0.791 
Other nonwhite -1.033 1.409 O77 0.80 21.6 0.463 0.349 0.369 
Female -0.502 0.803 1.09 $12 238 0.532 0.571 0.581 
Log of net worth + $1,000 0.015 0.215 0.87 0.89 23.6 0.943 0.936 0.937 
Less than high school -1.690 L213 1.00 1.04 252 0.165 0.173 0.192 
Some college -1.140 0.879 0.77 0.78 26.0 0.195 0.097 0.108 
College graduate -0.703 1.047 0.78 0.79 21.1 0.502 0.393 0.404 
Age 0.059 0.032 0.91 0.93 26.5 0.064 0.047 0.056 
Married 0.541 0.748 1.05 1.07 28.5 0.470 0.496 0.504 
Baseline Health 
1 chronic condition (of 19) -0.973 1.039 0.92 0.94 Pta'gg | 0.349 0.313 0.327 
2 chronic conditions 0.198 1.116 0.87 0.90 23.0 0.859 0.840 0.846 
3+ chronic conditions -0.201 17132 0.90 0.91 24.0 0.859 0.844 0.847 
Depression and dysthymia -5.305 339 0.93 0.95 25.8 0.000 0.000 0.000 
Depression or dysthymia -3.882 0.982 12 ali yas i 0.000 0.001 0.002 
Prior depression only -2.396 1.109 1.02 1.05 ai 93 0.031 0.040 0.052 
Mental component of SF-12 0.287 0.036 rel 1.14 26.6 0.000 0.000 0.000 
Physical comp of SF-12 0.079 0.036 0.88 0.89 24.6 0.029 0.017 0.022 
Anxiety disorder -2.438 0.749 1.20 123 26.3 0.001 0.010 0.014 
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Because patients from the same clinics could have 
similar outcomes, OLS standard errors could easily be too 
low-especially for PSU-—level variables like Intervention. 
Columns 3 and 4 of Table 4 show the ratios of linearization 
and BRL standard errors to the OLS standard errors. We 
use clinic as the PSU because there is very little reason to 
expect correlations of errors across clinics after controlling 
for block. 

Using the method of Wu, Holt and Holmes (1988), we 
estimate the intra-clinic correlation of the errors as -0.0026, 
easily consistent with a true value of 0. Nonetheless, there 
is no reason to expect any of the correct standard errors to 
fall much below those obtained from OLS. Column 3 of 
Table 4 shows that the linearization standard errors 
frequently fall far below those obtained from OLS — 
especially for the PSU-level explanatory variables at the 
top of the table. Similarly, linearization with a reference 
t,_, often produces much smaller P-values than does OLS. 
BRL improves over linearization. BRL standard errors are 
always larger and sometimes substantially larger than the 
linearization standard errors. For example, the BRL 
estimates for PSU-level explanatory variables are on 
average 15 percent larger than the linearization estimates. 
On the other hand, BRL standard errors for PSU-—level 
variables are still often smaller than the OLS estimates. 
Thus, even though BRL estimators should be nearly 
unbiased, the variability in the estimators results in esti- 
mates for some coefficients that are small. The variability 
is also reflected in degrees of freedom that are very small 
for the block indicators and, while larger for patient level 
variables, are still considerably less than 42, the number of 
clusters minus one. The degrees of freedom are especially 
small, 7.6, for the indictor variable Black (equal to one if 
the patient was African American and zero otherwise). 
Plots analogous to Figure 1 show that Black was con- 
centrated in three clusters. The Black indicator equals zero 
for all the patients in 24 of 43 clusters, and 48 of the 78 
African Americans in the sample were found in just three 
clusters. As discussed in sections 2 and 4, the concentration 
of Black into a small number of clusters results in high 
variance for both estimators and large bias in the linear- 
ization estimator, both of which can be seen in Table 4. 


8. DISCUSSION 


Although linearization is a valuable tool that provides 
consistent standard errors and valid inference as the number 
of PSUs grows large in multi-stage samples, users should 
recognize problems with the method. Estimated variances 
of linear regression coefficients (including domain means) 
tend to be biased low — especially for coefficients (or linear 
combinations of coefficients) that depend largely on data 
from a small number of PSUs. Depending on the design, 
large biases can persist even when the total number of PSUs 
is quite large. The standard jackknife for multi-stage 


samples tends to have at least as large bias in the opposite 
direction. Similarly, using a reference t distribution with 
degrees of freedom equal to one less than the number of 
PSUs may greatly understate the uncertainty in the 
estimated variance. Because the two problems (bias and 
overstated degrees of freedom) tend to occur in tandem for 
linearization, confidence intervals and statistical tests based 
on that method may be far too liberal. 

Bias reduced linearization (BRL) produces unbiased 
variance estimates in the event that errors are homo- 
skedastic and uncorrelated, and it tends to greatly reduce 
bias for other covariance structures investigated in our 
simulations. In our simulations, BRL consistently exhibited 
smaller biases than linearization by 90 percent or more and 
tended to improve substantially on Kott’s 1994 adjusted 
linearization method. Results for BRL were comparable to 
those for Kott’s 1996 method. 

When BRL was used with the estimated Satterthwaite 
degrees of freedom, statistical inference improved greatly 
in comparison with the standard use of linearization. Bias 
reduction and Satterthwaite degrees of freedom seemed to 
contribute about equally to the improved performance. 
Although Satterthwaite’s approximation may overcom- 
pensate, leading to conservative inference in certain situa- 
tions, the problem does not seem noteworthy until the 
Satterthwaite degrees of freedom drop below 5 (based, in 
part, on simulations not reported in this paper). In such 
cases, analysts might choose to estimate critical values 
using simulations based on Theorem 4. 

It is important to note some limitations of our simulation 
results. First, we only report results for four distinct expla- 
natory variables plus an intercept. We choose those 
variables to span a wide variety of situations. Although 
some might describe x, as extreme or pathological, it is not 
outside the range of situations that we have seen in our own 
consulting work. Variables like x, can results from group- 
randomized trials (see section 7) or observational data 
where only a few PSUs exhibit a particular trait or from use 
of a series of dummy variables to represent levels of a 
categorical variable. Second, we present results only for 
n=20 PSUs. To the extent that X remains similar as n 
increases (e.g., by replication), Equation (4) implies that the 
bias declines in proportion to 1/(n-1). Also, the results 
observed for n = 20 could occur for much larger n if the 
bulk of the variation in X is contributed by a few PSUs, and 
the determination of 1’ depends similarly on a small 
number of PSUs. Finally, to reduce the number of factors 
affecting the results, we simplified the design in several 
ways: constant PSU sizes, no weights or strata, and little 
multicollinearity. We suspect that relaxing any of those 
constraints would actually tend to make standard lineari- 
zation and the jackknife perform worse. We do not believe 
that the choice of m = 10 for the PSU size had much impact 
either way on our findings. 

Although we believe that our proposed methods will 
prove valuable to analysts of multi-stage samples, these 
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methods will not completely solve the inference problem 
for unweighted linear regression. Both authors have 
frequently observed the disturbing situation where standard 
linearization methods produced shorter confidence intervals 
than methods that ignore the design. Certainly, the bias of v, 
and improper use of n - 1 degrees of freedom contribute to 
the frequency of this phenomenon, but our methods would 
not eliminate its occurrence (see section 7). Linearization, 
like sample reuse methods, necessarily produces estimators 
with high variance for some or possibly all coefficients in 
certain designs. When confronted with situations like the 
coefficients for our x,, where the Satterthwaite degrees of 
freedom fall near 3 or lower, analysts should seriously 
consider whether they can afford the large variability, and 
corresponding loss of power, that comes with nonpara- 
metric variance estimators. Parametric alternatives like 
hierarchical linear models or inference based on estimating 
a common intra-class correlation across all the PSUs (Wu, 
Holt, and Holmes 1988) should produce more stable results. 

Although this paper has focused on unweighted linear 
regression for samples without stratification, we have no 
reason to expect that the bias and degrees-of-freedom 
problems of linearization would be lessened by stratifica- 
tion or for either weighted least squares or generalized 
linear models (GLMs). As shown in McCaffrey, Bell and 
Botts (2001) the BRL method extends immediately to 
weighted linear regression by using H= 
X(X’WX)7!X’W in the main condition of Theorem 3. 
Because solutions to GLMs, such as logistic regression, are 
equivalent to the final steps of iteratively reweighted least 
squares (McCullagh and Nelder 1989), the obvious choice 
for these models is to use BRL based on the final weights 
and to set U = W’!. Nevertheless, Theorem 3 does not 
extend to GLMs because the weights are estimated from the 
data, and we have not investigated the properties of BRL in 
this context. 

Korn and Graubard (1995) suggest ee as a standard 
error estimator for stratified samples in situations where the 
stratification is non-informative. The same reasoning 
applies to Vgp,. Fuller (1975) proposed an alternative 
design consistent standard error estimator for stratified 
samples. Bell and McCaffrey (2002, pages 32-33) show that 
by adjusting the vector of residuals for each stratum, BRL 
can reduce or remove the model bias that can exist in 
Fuller’s estimator. 
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APPENDIX 


Proofs of Theorems 2 and 4 


Proof of Theorem 2 
Following the first steps of the proof of Theorem 1, 
equation (6) implies that 


EWyp)= 


Ds X; (1,-H,)'X, 


i=] 


[21 l'(X'X)} Cd: 
n 


The existence of (I, = Hee implies that the eigenvalues of 
H,,,are strictly less than 1, so that (I; = H,,)' can be written 
as )',) Hj. Consequently, letting D =(1/n)(X’X) and 
D, = (X;X,) - D, we have 


E(v,x) 


i=l 


= sh rcx'xy! ys 
n 


{24} rcexy » p> | i ~[D,(x'x)"}7 


-( Dreexy) b> y i) + [pexy'} 
i=] n 


n 


The term for r =0 equals /'(X’X)7!1=Var(l'B). The 
term for r = 1 equals 0. By the binomial theorem, 


fonts jicl aheme Sie 

s=0 r n s n-1 { 
so that the remaining terms can be paired, for r = 2, 4, 6, ..., 
to give 


(=) APS. {[ox J? 


n-1 


[D ,(x’ xr) 


(xxy-[ fs Joc 003)" 


n-l 


The middle factor in the summation can be written as , 


(22) OO5E — (X'X) 1 (XK X,) (K'X)", 
je 


n-1 


180 Bell and McCaffrey: Bias Reduction in Standard Errors for Linear Regression with Multi-Stage Samples 


which is positive definite, so that the whole expression must 
be positive. Consequently, we have shown that 
E(v,,) 2 Var(l' 6) with equality if and only if 
BX) [Di= =0, which is true if and only if 
l'(X'X)” ».¢ se is constant across i. 


Proof of Theorem 4 


y= cx) 'X/ A ,(I-H),e«’ I-H)/ 


i=l 


A,X,(X'X)71 


Let P equal the matrix of eigenvectors and A denote the 
diagonal matrix with elements 4., -, Ay equal to the eigen- 
Valuestmol es V1 2r) eng ge yin - B’ B where B’ = 
Vv" [g.g,-.-2,]. Let u =P’ V-!?y where V2 VV"? =1 
defines V!2 hee the elements of u are independent normal 
variables with variance 1 and 


M 
v* =u’Au=)> Au; 
=I 
Let 4, be any nonzero eigenvalue of B’B, then there 
exits a nonzero vector z such that B’ Bz = d,z and 
BB’'Bz =i,Bz. Because Bz+#0, d, is an eigenvalue of 
BB’. Similarly, any nonzero eigenvalue of BB’ is also 
an eigenvalue of B’B. Therefore, the nonzero eigenvalues 
of B’B equal the nonzero eigenvalues of BB’ = {g : Vg}. 
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Design Effects of Sampling Frames in Establishments Survey 


MONROE G. SIRKEN’ 


ABSTRACT 


When stand-alone sampling frames that list all establishments and their measures of size are available, establishment surveys 
typically use the Hansen-Hurwitz (HH) pps estimator to estimate the volume of transactions that establishments have with 
populations. This paper proposes the network sampling (NS) version of the HH estimator as a pctential competitor of the 
HH estimator. The NS estimator depends on the population survey-generated establishment frame that lists households and 
their selection probabilities in a population sample survey, and the number of transactions, if any, of each household with 
each establishment. A statistical model is developed in this paper to compare the efficiencies of the HH and NS estimators 
in single-stage and two-stage establishment sample surveys assuming the stand-alone sampling frame and the population 
survey-generated frame are flawless in coverage and size measures. 


KEY WORDS: Stand-alone establishment frames; Population survey-generated establishment frames; Hansen-Hurwitz 


estimator; Network sampling estimator. 


1. INTRODUCTION 


Listings of establishments that have transactions with 
households in population sample surveys serve as sampling 
frames of establishment surveys whenever the transactions 
reported by households in the population surveys are 
matched with the records of their establishments. For 
example, the listings of establishments that have trans- 
actions with households in the National Medical 
Expenditure Panel Survey (MEPS), a national population 
sample survey, serve as sampling frames for medical 
provider surveys that supplement and verify the medical 
expenditures of the transactions reported by MEPS house- 
hold respondents (Cohen 1998). However, listings of esta- 
blishments that have transactions with households in popu- 
lation sample surveys rarely serve as frames of establish- 
ment surveys that collect information about the transactions 
that establishments have with all households. The Current 
Price Index (CPI) produced by the Bureau of Labor 
Statistics is a notable and rare exception of a Federal esta- 
blishment survey that depends on a population survey- 
generated sampling frame. The CPI Pricing Survey, a 
national retail establishment survey, that collects prices for 
a basket of consumer goods purchased by all customers, 
uses as its sampling frame the listings of retail establish- 
ments that have transactions with households in the CPI 
Continuing Point of Purchase Survey. (Leaver and Valliant 
1995). 

After reviewing plans of the National Center for Health 
Statistics (NCHS) to restructure its family of independent 
national surveys of health providers (hospitals, physicians, 
clinics, etc.), a Panel of the Committee on National 
Statistics proposed (Wunderlich 1992) using listings of 
health care providers reported by households in the 
National Health Interview Survey (NHIS), an ongoing 


national household sample survey (Massey, Moore, Parsons 
and Tadros 1991) as the sampling frames for national 
surveys of health care providers. The Committee thought 
that, especially in the current environment of rapid changes 
in listings of health care providers due to rapid changes in 
the nation’s health care delivery system, the NHIS-gener- 
ated health care provider frames would be more accurate 
and easier and less expensive to construct and maintain than 
the free-standing health care provider frames currently in 
use. Soon after the Panel report was issued, NCHS initiated 
a research project on population survey-generated sampling 
frames that is briefly summarized below. 

Initially, the research focused almost exclusively on the 
statistical properties of NHIS-generated frames of health 
care providers. Judkins, Berk, Edwards, Mohr, Stewart and 
Waksberg (1995) studied the quality of the free-standing 
health provider frames currently in use or of potential use, 
and discussed the kinds of medical providers for which 
NHIS-generated frames would seem to have the greatest 
potential. Subsequently, Judkins, Marker, Waksberg, 
Botman and Massey (1999) made rough comparisons of 
the efficiencies of dental surveys using the NHIS-generated 
sampling frame and using the free-standing frame, and 
concluded that NHIS-generated health care provider frames 
deserve serious consideration whenever reasonably 
complete free-standing health care provider frames with 
reasonably good size measures are unavailable. 

In recent years, the research has focused on the statistical 
properties of estimators that depend on population-gener- 
ated sampling frames and has become more theoretically 
focused than formerly. The conceptual difficulties initially 
encountered in developing unbiased estimators for the 
population survey-generated frame because the same estab- 
lishments have transactions with multiple households were 
overcome by applying network sampling theory. (Sirken 
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1997; Thompson 1992). Sirken, Shimizu and Judkins 
(1995) developed the network sampling version of the HH 
estimator, referred to in this paper as the NS estimator, and 
Sirken and Shimizu (1999) developed the network sampling 
version of the Horwitz-Thompson (HT) estimator. This 
paper develops a statistical error model that compares the 
efficiencies of the NS estimator that depends on the popu- 
lation survey-generated frame, and the HH estimator that 
depends on the free-standing frame. The error model 
assumes both frames are flawless in establishment coverage 
and size measures and have equivalent construction and 
maintenance costs. Though the model assumes a srs design 
for the population survey that generates population survey- 
generated sampling frame, the model can be applied to 
other kinds of population survey designs that are not 
considered in this paper. 

This paper is organized as follows. Notation follows in 
section 2. Section 3.1 and section 3.2 respectively present 
the pps self-weighted HH estimator and variance of the 
two-stage establishment sample survey that depends on the 
free-standing sampling frame, and the NS estimator and 
variance of a two-stage establishment survey that depends 
on the population survey-generated frame. The error model 
is developed in sections 4.1- 4.4. The difference between 
two-stage HH and NS variances of equivalent expected 
sample sizes is developed in section 4.1. In section 4.2, the 
first stage variance component of the two-stage NS esti- 
mator is split into variance components representing effects 
of households with and without transactions, and section 
4.3 shows the design effects of the NS estimator in single 
stage sampling. Second stage variance components of the 
NS and HH estimators are compared in section 4.4. In the 
concluding section 5, the error model’s major findings 
comparing efficiencies of HH and NS estimators in single- 
stage and two-stage establishment surveys are briefly sum- 
marized, and limitations of the model are briefly discussed. 
The appendix presents the proof of a statistical statement 
appearing in section 4.2. 


2. NOTATION 


Let N, = the number of households having transactions 
with establishment j(j = 1, 2, ..., R), N,, = the number of 
households not having transactions with any establish- 
ments, and N * = the number of distinct households having 
transactions with R establishments. Then, N =N * ¥Neeothe 
total number of households. 

Let M,,= the number of transactions of establishment 
IJV =1, 2. .,R) with household i(i = 1,2, ...,N), where 
M.. ij 20 when establishment j has transactions with house- 
hold i, and M.. ij = = 0 when establishment j j and household i do 
not have transactions. Then, M. = yay | M,, = the number of 
transactions of establishment j with N households, and 
M = ya M, = the number of transactions of M establish- 
ments “with N households, and M=M/N the average 
number of transactions per household. 


Let X,, denote the value of the x-variate for transaction 

k(k =1,..., M, i) of establishment j(j = 1, 2,...,R). Then, 
M 

ayy i X;,= the sum of the x-variate over the M, 
transactions of establishment j, and X = ee X.= sum of 
the x-variate over the M transactions of R Bsr, 
Let X. = X,/M. = the average value of the x-variate over the 
M, transactions of establishment j, and X = X/M= the 
average value of the x-variate over M transactions. 


3. ESTIMATORS AND VARIANCES 


3.1 The HH Estimator and Variance 


Consider a two-stage self weighted establishment sample 
survey using a free-standing establishment sampling frame 
that lists all R establishments and their measures of size, 
M, (j=1,2,..,R). Establishments are the primary 
sampling units (psu’s), and transactions are the secondary 
sampling units. A sample of r establishments is selected 
with pps with replacement from the free-standing frame, 
and a sample of size 4, < min (M,, ... »M,, ..., Mp) trans- 
actions each, where 1,,,, is a positive integer, is ntopen: 
dently selected by simple random sampling without replace- 
ment for each sample establishment j(j = 1, 2,..., r). 

The unbiased self-weighted pps HH estimator of X is 


ee (1) 
r fel 
where X/ = sands /ty, is the unbiased estimate of 


X,= x, /M; (j=1, ie aia Because establishments are 

selected with replacement, the HH estimator counts x, as 

many times as establishment j is selected in the sample. 
The variance of the Xn 18 (Thompson 1992) 


2 
Vat ( Xo) = ~— 9; Onn * > (M; - boy) 0; (2) 


Thay j= 


where the first and second terms respectively on the right 
side of (2) are the first and second stage variance com- 
ponents, and 


R 
Gin = — >> M, (X, -x/My (3) 
M j=l 
is the between establishment population variance, and 


yx 


Mol 
k=1 


.-X/M,P (4) 


is the within establishment population variance of esta- 
blishment /. 


3.2 The NS Estimator and Variance 


Consider a two-stage establishment sample survey that 
depends on a population survey-generated frame. The frame 
lists n sample households H,'(i =1,2,....0) that were 
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enumerated in a population sample survey. For each listed 
household H,,’, the frame provides 1, its selection probabi- 
lity in the household survey, and M;;; the number of its 
transactions with each distinct establishment 
WCpe ise: sek) (ne M;,’ s are reported by household 
respondents i in the population sample survey). 

Each of the n listed households in the population survey- 
generated frame represents a cluster of establishments 
ranging in size from 0 to R establishments with whom the 
household has transactions. The n clusters of establishments 
are the primary sampling units, and the M,; (Ee AE 2 
transactions of the r sampled establishments are secondary 
vate units. The transaction sample for establishment j 
: (j= ., R) is selected as follows: a srs sample of size 

oe < a (M,,M,,...,M_) transactions is indepen- 
a vole without replacement for each sample house- 
hold H,’ (i= ..,N), where ty, is a positive integer. The 
transaction aap pe of Re SHInenE HK in) inten GAO @ 
is equal to ty, Y;_,M. ;» and the total transaction sample 
size is equal to tt, Ree t=) ¥ eal M,,= the sum of 
the transactions over n sample households is a random 
variable. 

The NS estimator of X is 


Xxs =) DD M;; x} (7) 


i=l 1; jeA; 


where A : is the cluster of distinct establishments that have 
transactions with sample household H,, and 


tus Mi 
Xi! Cus Mi;) 


k=] 


is an unbiased estimate x, for a sample of 4, M,, trans- 
: ij 
actions of establishment j. Because households are selected 
with replacement, the NS estimator counts the quantity 
yi A M,,X jO every time household H, (i = 1, 2, ...,m) is 
selected | in the sample, and because the same ecanmeement 
has transactions with multiple households, the NS estimator 
counts the quantity M,, Xx; (i) every time a sample house- 
ij 
hold i (i = 1, 2, ..., 2) contains establishment /. 
Assuming a srs design in the population survey, 
t, = n/N, and the network sampling estimator is 


Nx as 
Xue De My, My (O. (5) 
N i=l jeA; 


The NS estimator is an unbiased estimator of X. 


N i 
Xxs) = Ey M,,X/ (2) =) Ds M;, Xx; 


i=l jeA t=1 jeA; 
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The NS estimator in (5) is self-weighted because we have 
assumed that the n households are selected by srs. It would 
be a self-weighted estimator whenever the sample design of 
the population sample survey that generates the 
establishment sampling frame is self-weighted. When 
N=N* =M, implying that N* households each has a 
single transaction, and N) =N-N * households are without 
transactions, and when n = r and tus = Sap the HH and NS 
estimators are equivalent. 


Xfce Sede XA@ = ESDP ag 


NM i=l jeA; i=l jeA,; 


Mw =, 
= — )' X) = Xap. (6) 


The variance of the NS estimator (5), under srs sampling 
with replacement of n households and independent 
selections of ty, M;, transaction by srs without replacement 
for each establishment j linked to household H,, is (Sirken 
etal. 1995) 


N2 N Ol 
Var (Xx) = — Onsi * wey 
Ning i=l j=l 
M.-t..M.. 
JOINS «(Uae 
LE gence eT EY 


where the first and second terms respectively on the right 
side of (7) are the first and second stage variance 
components, and 


ie ; 
Osi = 5 (XM - 


X/N\? (8) 
i JEA 
is the population variance between households, and oj, the 
population variance within establishment j as defined in (4). 
An unbiased estimate of NS variance is 


pf ban 5 Pa x )- i (9) 


V 
ar ns) ~ ino at |S 


where X’ = X’/N. 


4. THE ERROR MODEL 


4.1 HH and NS Variances of Equivalent Expected 
Sample Size 


Subtracting (2) from (7), the difference between the 
variances of the HH and NS estimators of X is 
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M 
Var ( 9 dad 600.6 ee i aligea ae 
Now M,-tysM,, 5 
D pas M;; : M : j 
Nyy i=1 j=l ; 
= 10 
Pi ig IV, (M, ~ ty) 6? (10) 
Thy j=1 


where the first and second set of bracketed terms respec- 
tively on the right side of (10) represent the differences 
between the primary and secondary variance components of 
the HH and NS estimators of X. 

Let m,, = Thy, = the size of the transaction sample in the 
establishment survey using the population survey-generated 
frame, where 1,,, a positive integer, is the size of the trans- 
action sample selected per transaction of the n sample 
households, and t =). oe 4,M,,; = Sum of the transactions 
of n sample households. 

Clearly, t is a random variable and its expected value 
conditional over all samples of n households is E (t|n) = 
nM where M = M/N= i et household transaction size. 
It follows that E(m,,|n) =t,, E(t|n) =n Mty, Rees Lite 
expected transaction sample size of the NS estimator condi- 
tional over all samples of n households. 

Let my, = ty, = the size of the transaction sample in 
bey establishment survey using the stand-alone frame, where 

= the establishment sample size, and tay = the transaction 
cant size per selected establishment. Let r = E(t|n) = 
nM and let 4, = ty, =, and it follows the expected trans- 
action sample : ses ai the NS and HH estimators condi- 
tional over all samples of n households are equivalent, 
namely, E(m,,,,|n) = tE(t|n) =ntM = E(m,,|n). 

Calibrating the establishment and transaction sample 
sizes in this manner assures that HH and the NS establish- 
ment surveys are conducted under roughly the same fiscal 
constraints if per establishment and per transaction field 
costs are about the same in both surveys. It is noteworthy, 
however, that this cost equation does not take into account 
the differences in costs between constructing and main- 
taining stand-alone establishment frames and population 
survey-generated establishment frames. 

Substituting r=nM, ty, =ty.=t, and M=NM in 
formula (9), the diference pe recn the NS and HH 
variances of equivalent expected establishment and trans- 
action sample size conditional over all samples of n house- 
holds is 


; . Niog 4 <2 
Var (Ang) — Vat( Xs) > ae [Onsy ~ Mou] 

NG » MM; Mj) 
e f2 Se LE eS J (11) 


t=] 


The first term and second terms respectively on the right 
side of (11) represent the difference between the first stage 
and second stage variance components of the NS and HH 
estimators of equivalent expected sample sizes conditional 
over all samples of n households. 


4.2 Decomposition of the Single Stage NS 
Population Variance 


Typically, some households do not have transactions 
with any establishments, and the percentage varies by type 
of establishment. For example, medical care utilization by 
families in the United States varies greatly by type of health 
care provider (Dicker and Sunshine 1987). During a 12 
month period, 70 percent of families were not admitted to 
hospitals, 7 percent did not have ambulatory physician 
visits, and 28 percent did not have dental visits. 

Let 


* 


Pe ~ = fraction of N households with one 


or more transactions, and 


N, 
Foal en = fraction of N households without 


any transactions. 


We demonstrate in the Appendix that the single stage 
population variance of the NS estimator of X, when 
expressed as a function of P, decomposes into 2 parts 


2 ie = obvild 
ois (P= EEE my 


i=1 \ jeA, 


cil +62 (P)E,..- O<P<1 (12) 


where 


(13) 


is the single stage population variance of the x-variate over 
the truncated population of N* households with one or 
more transactions, 
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2 
~ Onsi* (14) 


EE(E M5) 


* ; 
N* i=1 JEA; 


is the expected value squared of the x-variate over the 
truncated population of N * households and 


o (P) = P(1-P) (15) 


is the variance of the binomial variable P. For fixed M, the 
function 645;(P|M) is maximum when 


eee te 5 |esi! Eee sn 
2 2 4 2 2 
VG aks Sanpbae Dae oa ~ srlatiiand2 v7 Gwe Oneal G 


2 Pe est: 
When | P =1,07(P =1) =0 and therefore Ons, (P = 1) = 
. If P=M=(M/N) =1, implying that each of N 
nis! cholds has a single transaction, 


Ons) (P =M=1)=6,. .(N*=M) =Om, (16) 
because 
ING 2 
2 1 - XxX 
o.(N*=M) = M,, X,- 
Agi ap ee aN 


M» 
& 
OX! 


Nen\- 2 
4) Onn, (L7) 


and, o?(P =1)=0. In other words when P=M=1, 
implying each of the N households has a single transaction, 
the variance of the NS1 estimator which would then depend 
on a srs of transactions with replacement is equivalent to the 
variance of the HH1 estimator that depends on a pps cluster 
sample of equivalent sample size selected with replacement. 


4.3 Design Effects in Single Stage Sampling 


mH 1 
DS Xx, = the unbiased HH estimator of X 


in single stage sampling. 

Define the single stage sampling total design effect of 
the NS1 estimator as the ratio of the variances of the NS1 
and HH1 estimators of equivalent sample size conditional 
over all samples of n households. 
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Var(Xtedaahsos, OR) 
(P= = (18) 
Var(Si) = M One, 


where 1. (P) <1 indicates that the NS1 estimator is more 
efficient than the HH1 estimator, and A(P)> 1 indicates 
that the HH1 estimator is more efficient than the NS1 
estimator. 

We noted in (12) and (15) ie Onsi(P) = Bore 
P(1-P)(X/N*)*, and in (16) that oy4, = 0..,.(N* =M). 
Making these substitutions in (18), the ott design effect 
becomes 


RCP) = dettes, + UP) Zu, 0<P ste (19) 
where 
’ — ain 
SI EIT TE (20) 
Ons1" (N* =M) 


is the effect due to the NV, households without transactions, 
and 


Po... Pox. 
ae tiae tae Shoes skal nest ee ONE CDN) 


Moun M ee (N* =M) 


is effect due to the N* households with transactions. In 
other words, deftys) is the design effect of network 
sampling a population of N* household clusters containing 
one or more transactions, with equal probability and 
replacement, compared to network sampling a population 
of M transactions, of equivalent expected sample size, by 
srs and replacement. ge reader is referred to Kish (1982) 
for the definition of deft*]. 

The total design effect in (19) depends on deft and 
Zs; and, P, and the values of these parameters, as well as 
relationships between them, are likely to vary considerably 
between surveys, and between variables and population 
domains in the same surveys. Though, in theory, the NS1 
estimator could be more efficient than HH1 estimator, in 
reality that outcome seems highly unlikely because cluster 
sampling is typically less efficient than srs. A necessary 
condition for the NS1 estimator to be as efficient or more 
efficient than the HH1 estimator is that deft. < Be 
(1-P)Z,.,, and this condition is unlikely to be met 
particularly if P is small, and if the within household trans- 
action clustering is mostly due to households having 
multiple transactions with the same establishments rather 
than households having transactions with multiple esta- 
blishments. 


4.4 Comparing Efficiencies in Two-stage Sampling 


In two stage sampling, the difference between the HH 
and NS second stage variance components for equivalent 
expected sample size of ntM transactions conditional over 
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all samples of n households, the second term on the right 
side of equation (11), reduces to 


N M,,(M,-tM,) 


J 


(22) 


where p,/M,=1/M,Y;_,M,,(M,,-1) is the difference 
between the HH and NS second stage finite population 
corrections for establishment j. If none of the N households 
have multiple transactions with establishment j, the HH and 
NS second stage variances of establishment j are equivalent 
and p.=0. Otherwise, p.>0 and second stage variance for 
establishment j is larger for the HH than the NS estimator. 
The value of p, is maximum when establishment j has M, 
transactions with a single household. 

The second stage variance components of the HH and 
NS estimators are equivalent ye p; = 0, when, that is, 
none of the H households have multiple transactions with 
any of the R establishments. Of course, second stage 
variances are equivalent if transactions are selected with 
replacement or the within establishment variances, 
G; =0(j=1,2,...,.R). Except for these contingencies, 
however, the second stage variance is always larger for the 
HH estimator than for the NS estimator, and the magnitude 
of the difference depends on the extent of within household 
clustering of transactions with the same establishments, and 
the magnitudes of the within establishment variances. 

If none of the N* households have multiple transactions 
with the same establishments, the difference between the 
variances of the HH and NS estimators are equivalent in 
single stage and two stage establishment sample surveys. 
Otherwise, the difference between HH and NS variances is 
less in two stage than in single stage establishment sample 
surveys because whenever households have multiple trans- 
actions with the same establishments the second stage 
variance is greater for the HH estimator than for the NS 
estimator. 


5. SUMMARY AND CONCLUDING REMARKS 


The error model presented in this paper compares 
efficiencies of two estimators of the volume of transactions 
between establishments and populations in single-stage and 
two-stage establishment sample surveys. The Hansen- 
Hurwitz (HH) estimator depends on a stand-alone sampling 
frame that lists every establishment and the volume of its 
transactions with all households during a specified calendar 
period. The network sampling (NS) estimator depends on 
a population survey-generated frame that lists the house- 
holds and their selection probabilities in a population 
sample survey, and for each household, lists the number of 


its transactions with each distinct establishment during the 
specified calendar period. 

Also, the NS and HH estimators depend on different 
establishment survey sample designs. In single-stage 
sampling, the HH estimator depends on a design in which 
establishments are the selection units and they are selected 
with pps with replacement, and the NS estimator depends 
on a design in which households are the selection units and 
they are selected with their selection probabilities in the 
population survey, which the error model assumes is srs 
with replacement. In two-stage sampling, transactions are 
the second stage sampling units of the HH and NS esti- 
mators. The HH estimator depends on fixed-size transaction 
samples that are selected by srs independently without 
replacement. The NS estimator depends on transaction 
sample sizes that are proportional to the number of trans- 
actions of each household with each establishment, and are 
selected independently by srs without replacement. 

The NS and HH estimators are equally efficient, if and 
only if, every household in the entire population has one 
and only one transaction. Otherwise, neither the NS or the 
HH estimator is necessarily more efficient than the other. 
Nevertheless, it seems likely that the HH estimator will be 
more efficient than the NS estimator in single-stage esta- 
blishment survey sampling, and perhaps substantially more 
efficient especially when large fractions of households do 
not have any transactions, and/or when the within house- 
hold clustering of transactions among households with 
transactions is principally due to households having 
multiple transactions with the same establishments rather 
than households having transactions with multiple esta- 
blishments. In two-stage sampling, the outcome is not as 
transparent as in single stage sampling because the second 
stage variance component is larger for the HH estimator 
than the NS estimator by an amount that depends on the 
extensiveness of within household clustering of transactions 
with the same establishments. 

Arguably the foremost limitation of the error model 
presented in this paper is the presumption that the stand- 
alone and population survey-generated sampling frames are 
flawless in coverage and size measures. However, compa- 
rative costs of constructing and maintaining good quality 
stand-alone and _ population-generated establishment 
sampling frames are likely to vary greatly from survey to 
survey. Though the model seek to equalize the establish- 
ment survey costs based on each kind of sampling frames it 
ignores the differential costs of constructing and 
maintaining each kinds of frame. 


Even in the absence of empirical data about the compa- 
rative costs of constructing and maintaining the frames, it 
is fair to say that the population survey-generated frame 
should be seriously considered as a potential design alter- 
native whenever constructing and maintaining good quality 
stand-alone frames would be infeasible or exorbitantly 
expensive or time consuming, and/or when constructing and 
maintaining good quality population survey-generated 
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establishment sampling frames would be relatively inexpen- 
sive. For example, the population survey-generated frame 
would be a particularly attractive as a potential design 
alternative to the stand-alone frame when the stand-alone 
frame would be difficult to construct and maintain because 
it was undergoing rapid changing due to births, deaths, and 
establishment mergers, and the population survey-gene- 
rated frame costs would be relatively small either because 
it could be constructed and maintained as a by-product of 
an ongoing population sample survey (Wunderlich 1992) 
and/or as a by-product of an ongoing program of matching 
transactions of households enumerated in a population 
survey with their establishment records (Cohen 1998). 

Another limitation of the model is the unrealistic 
assumption that the population survey that generates the 
establishment sampling frame is based on a single stage 
sample design in which households are selected with equal 
probabilities and with replacement. In fact, population 
surveys are virtually always based on multistage sample 
designs in which households are selected without replace- 
ment in the final sampling stage. Typically, the srs 
assumption tends to significantly understate the variance of 
the NS estimator, and therefore would have the effect of 
exaggerating the relative efficiency of the NS estimator 
compared to the HH estimator. On the other hand, the 
household sampling with replacement assumption would 
have the opposite effects, but would be modest (Sirken 
2001) compared to the srs assumtion. The error model can 
be applied, however, to the other population survey sample 
designs that are not considered in this paper. 

The error model presented in this paper identifies the 
critical parameters that determine the relative efficiency of 
establishment survey estimators depending on stand-alone 
and population survey-generated sampling frames. Values 
of these parameters will vary greatly between surveys and 
between variables and population domains in the same 
surveys. Unfortunately, empirical data are currently 
unavailable, and they are sorely needed to estimate the 
model’s parameters under a broad range of survey condi- 
tions. Hopefully, this paper will stimulate interest in 
conducting establishment surveys that depend on popu- 
lation survey-generated establishment sampling frames, and 
will lead to improvements in designing establishment 
surveys that estimate the volume of transactions between 
establishments and populations. 
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APPENDIX 


When expressed as a function of P, the fraction of house- 
holds with one or more transactions, the single stage popu- 
lation variance of the network sampling (NS) estimator of 


decomposes into 2 parts 
2 


A 2 2 2 
eee es, = Oe +O (CBW ent ee Hep] 
where 
rep analne 
N 


is the truncated single stage population variance of the NS 
estimator exclusive of the N= N - N* households without 
transactions with establishments, 


o?(P) = P(1-P) 
is the variance of the binomial variable P, and 
2, Hf *\2 
| re = (X/N~) 
is the expected value squared of the x-variate distributed 
over N* households. 


Proof 


2 — 
Ons1 = 


Nos {ark _ 2 oN 
aetlaNS YS MX. = x ue ce 
Neca uh ay Anse Weve Ni Net SN (A.1) 


(A.1). 
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Substitute (A.2) for the first term on the right side of ( A.1). 


Onsi(P) = Por. s+ P me sss “a= 2) 
=P Ot (Py + CP) Es (A.3) 
where 
G( P= Pend see = 
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A Generalization of the Lavallée and Hidiroglou Algorithm for 
Stratification in Business Surveys 


LOUIS-PAUL RIVEST' 


ABSTRACT 


This paper suggests stratification algorithms that account for a discrepancy between the stratification variable and the study 
variable when planning a stratified survey design. Two models are proposed for the change between these two variables. 
One is a log-linear regression model; the other postulates that the study variable and the stratification variable coincide for 
most units, and that large discrepancies occur for some units. Then, the Lavallée and Hidiroglou (1988) stratification 
algorithm is modified to incorporate these models in the determination of the optimal sample sizes and of the optimal 
stratum boundaries for a stratified sampling design. An example illustrates the performance of the new stratification 
algorithm. A discussion of the numerical implementation of this algorithm is also presented. 


KEY WORDS: Neyman allocation; Power allocation; Stratified random sampling. 


1. INTRODUCTION 


The construction of stratified sampling designs has a 
long history in the statistical sciences. Chapters 5 and 5A in 
Cochran (1977) review several techniques for splitting a 
population into strata. The construction of strata is a topic 
of current interest in the statistical literature. Recent contri- 
butions include Hedlin (2000) who revisits Ekman (1959) 
rule for stratification, and Dorfman and Valiant (2000) who 
compare model-based stratification with balanced sampling. 
Model based stratification, is discussed in Godfrey, 
Roshwalb, and Wright (1984) and in chapter 12 of Sarndal, 
Swensson, and Wretman (1992). 

In business surveys, populations have skewed distri- 
butions; a small number of units accounts for a large share 
of the total of the study variable. It is therefore appropriate 
to include all large units in the sample (Dalenius 1952; 
Glasser 1962). A good sampling design has one take-all 
stratum for big firms, where the units are all sampled, 
together with take-some strata for businesses of medium 
and small sizes. Typically the sampling fraction goes down 
with the size of the unit; small businesses get large 
sampling weights. The Lavallée and Hidiroglou (1988) 
stratification algorithm is often used to determine the 
stratum boundaries and the stratum sample sizes in this 
context (see for instance Slanta and Krenzke 1994, 1996). 
This algorithm uses a stratification variable, known for all 
the units of the population. It gives the stratum boundaries 
and the stratum sample sizes that minimize the total sample 
size required to achieve a target level of precision. It uses an 
iterative procedure, due to Sethi (1963), to determine the 
optimal stratum boundaries. The Lavallée and Hidiroglou 
algorithm does not account for a difference between the 
stratification and the survey variables. As time goes by, this 
difference increases and the sampling design provided by 


1 


the Lavallée and Hidiroglou algorithm may fail to meet the 
precision criterion. 

Stratification in situations where the survey variable and 
the stratification variable differ is considered in Dalenius 
and Gurney (1951), see also Cochran (1977, chapter 5A). 
Many authors have studied approximate formulae for 
determining stratum boundaries, and for evaluating the gain 
in precision resulting from stratification on an auxiliary 
variable. Some relevant contributions are Serfling (1968), 
Singh and Sukatme (1969), Singh (1971), Singh and 
Parkash (1975), Anderson, Kish and Cornell (1976), Oslo 
(1976), Wang and Aggarwal (1984) and Yavada and Singh 
(1984). Hidiroglou and Srinath (1993) and Hidiroglou 
(1994) suggest techniques to update stratum boundaries 
using a new stratification variable. However these papers 
do not explicitly provide stratification algorithms 
accounting for the discrepancy between the stratification 
variable and the survey variable. This paper fills this gap by 
constructing generalizations of the Lavallée and Hidiroglou 
(1988) algorithm that express the difference between these 
two variables in terms of a statistical model. 

A brief review of stratified sampling and of sample 
allocation methods is first given. Models for the difference 
between stratification and survey variables are then pro- 
posed. The implementation of Sethi’s algorithm, when the 
stratification and the survey variable differ, is then 
presented. Numerical illustrations are provided. 


2. A REVIEW OF STRATIFIED RANDOM 
SAMPLING 


Some of the standard notation of stratified random 
sampling that will be used in this paper is 


L = the number of strata; 
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W, =N,/N is for h=1,...,L the relative weight of 
stratum h, N, is the size of stratum h, and N = YN, is 
the total population size; 


n, is for h=1,...,L the sample size in stratum h and 
fF, = 1,,/N,, 1s the sampling fraction; 


Y, , and y, are the population and sample means of Y 
within stratum h; 


5, , 18 the population standard deviation of Y within 
stratum h. 


In this paper the strata are constructed using X, a stratifi- 
cation variable. Stratum / consists of all units with an 
X-value in the interval (Dyn st ls where 

-0 = b)<b,<...<b,_,<b, = are the stratum boundaries. 

The survey estimator for Y can be expressed as 

y, =» W, y,3 its variance is given by: 


1G 
var(y,) =>) wale f a Si (2.1) 
h 


h=1 ny, 


In business surveys, all the big firms are sampled; we 
choose stratum L as the take-all stratum so that n TaN Ok 
h<L,n,, the sample size in take-some stratum h, can be 
written as (1-N,)a, where n is the total sample size and 
a, depends on the allocation rule. The two allocation rules 
that are considered in this paper are 


— The power allocation rule 


(2.2) 


where p is a positive number in (0, 1]; 


— The Neyman allocation rule 


(23) 


n=NW, + = 


= (2.4) 
Var(y,) + > W, S2,/N 
hel 


h™~yh 
The optimal stratum boundaries are the values of Dis eit De 
that minimize n subject to a requirement on the precision of Y 
such as Var (y,,) = Y’c* where c is the target coefficient 
of variation (CV). The range c = 1% to 10% is often used 
for business surveys. 


3. SOME MODELS FOR THE DISCREPANCY 
BETWEEN THE STRATIFICATION AND 
THE SURVEY VARIABLE 


In this section {x,,i =1,..,N} denotes the known 
stratification variable for the N units in the population. 
Many stratification algorithms, including Lavallée and 
Hidiroglou, suppose that (1;,1.— 1, NY} disO teplesemts 
the values of the study variable. This section suggests 
statistical models to account for a difference between these 
two variables. 

For the sequel, it is convenient to look at X and Y as 
continuous random variables and to let f(x), xe R denote 
the density of X. The data {x,,7 =1,..,N} can be viewed 
as N independent realizations of the random variable X. 
Since stratum h consists of the population units with an 
X-value in the interval (b,_,, b,], the stratification process 
uses the values of E(Y|b,>X>b,,) and 
Var (Y|b , 2X >b,_,), the conditional mean and variance of 
Y given that the unit falls in stratum h, for h = 1,..., L-1. 
Three models for the difference between X and Y are next 
given along with their conditional means and variances for 
ye 


3.1 A Log-linear Model 


The first model considers that log(Y)=a+ 
Biog log (X) +¢, where, ¢ is a normal random variable with 
mean 0 and variance o),,, which is independent from X, and 
a and Biog are parameters to be determined. When 
a =0,8,,, = 1 and Olog = 0, one has X = Y; the survey and 
the stratification variables are the same. In general, 
Y=e%X'e* The conditional moments of Y can be 
evaluated using the basic properties of the lognormal 
distribution (see Johnson and Kotz 1970), that is 


2 2 2 
E (e*) = eos”? and Var (e®) = e (ee she 


One has 
E(Y|b,2X>b,_,) =exp(0. + 0;,,/2) E(X'"*|b, > X>b,_,) 


while Var(Y|b, >X>b,_,) is equal to 
Var (E(Y|X) |b, >X>b,_,) +E (Var(¥|X)|b,>X>b, _,) 
= exp(2a +01.) {Var(X'**|b,>X>b,_,) 
o 2B. 
+(e°§-1)E(X "|b > X>b,_,)} 


2 
exp(2a + a) {e log FX Pies |b,2X>b,_,) 


-E(X"*|b > X>b,_,)?}. 
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The parameter values Biog and 6,,, can sometimes be 
calculated from, historical data. Simple ad hoc values are 
B,,.=1 and Oi,5'= =(1-p’) Var(log(X)). Here p is the 
assumed correlation between log(X) and log(Y). It can be 
set equal to predetermined values such as 0.95 or 0.99. 


3.2 A Linear Model 


In the survey sampling literature, the discrepancy 
between Y and X is often modeled with a heteroscedastic 
linear model, 


Y = Bin X + €, (3.5) 


where the conditional distribution of ¢, given X, has mean 
0 and variance o.., X", for some non negative parameter y. 
Straightforward calculations lead to E(Y|b,>X>b,_,) =B,, 
E(X|b,>X>b,,) while Var(Y|b, > X>b,_,) = Bin 
{Var(X |b, > X>b,_1) + yn/Byn)” E(X" |b, 2 X>5,_,)}- 

For an arbitrary y > 0, the conditional variance of Y 
depends on three conditional moments of X. The generali- 
zation of Sethi’s algorithm presented in section 5 does not 
work in this situation. Note however that when y = 2, the 
conditional mean and variance of Y are proportional to 
those for the log-linear model with 


Biog = 1 and on = log(1 * in! Bin)> : 


the USE factors are exp(a+ Ge /2)/B,, and 
exp(2a + Siog) / B for the conditional expectations and the 
conditional variances respectively. Thus the two models for 
the discrepancy between the stratification and the survey 
variable, either the log-linear model of section 3.1 or the 
linear model (3.5) with parameter y = 2, lead, in section 5, 
to the same stratified design provided that (3.6) holds. In the 
later sections, the log-linear model is used to represent the 
change between X and Y. It should give good results when 
the true relationship between Y and X is modeled by (3.5) 
with y = 2. When model (3.5) is assumed to hold with a 
smaller value of y, the algorithm of section 5 can still be 
implemented when y is set to either 0 or 1. This is however 
not pursued in this paper. 


3.3 A Random Replacement Model 


This model assumes that the stratification variable is 
equal to the survey variable, i.e., X = Y, for most units. 
There is however a small probability ¢ that a unit changed 
drastically; its Y value then has f(x) as density and is distri- 
buted independently of its X value. This is the approach 
used in Rivest (1999) to model the occurrence of stratum 
jumpers for which X is not representative of Y. More 
formally, this can be written as, 


(3.6) 


X with probability 1 -¢ 
Yagi , 
X_.. with probability ¢ 


new 
where X_.,, represents a random variable with density f(x) 
distributed independently of X. The conditional mean for Y 
under this model is given by 
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E(Y|b,>X>b,_,) = (1-s) E(X|b,>X>b, ,)+eE(X), 


while its conditional variance is equal to 


Var(Y|b,>X>b,_,) 
= (1s) E(X?|b,2X>b,_,)+eE XX”) 


- {(1-£)E(X|b,2X>b,_,) +eE(X)}. 


4. AN EXAMPLE 


Before addressing the technical details underlying the 
construction of the algorithms, it is convenient to look at an 
example. Consider the MU284 population of Sarndal, 
Swensson and Wretman (1992), presenting data on 284 
Swedish municipalities. 

To build a stratified design for estimating the average of 
RMTS85, the revenues from the 1985 municipal taxation, 
REV84, the real estate value according to 1984 assessment, 
is used as a stratification variable. One takes L =5 and set 
the target CV at 5%. Two stratified designs obtained with 
the Lavallée and Hidiroglou algorithm are given in Table 1, 
for the power allocation with p=0.7 and the Neyman 
allocation. Both have n = 19. When applied on survey 
variable RMT85, these two designs give estimators of total 
revenue with coefficients of variation of 8.3% and 7.3% 
respectively. Failing to account for a change between the 
survey and the stratification variables yields estimators that 
are more variable than expected. 


Table 1 
Stratified designs obtained with the Lavallée and 
Hidiroglou algorithm for the MU284 population using 
REV84 as stratification variable and a target CV of 5% 


Power allocation with p = 0.7 


b, Weak ys, Watlance ecoN. . ihe. f,e oft 
Stratum 12251 874 56,250 86 POLO LS 
stratum2 2,352 1,696 100;898) 6 82° or200.02 «19 
stratum3 4,603 3,114 a5P 547 20) = 3 UOS 19 
stratuum4 10,606 6,442 2,027,436 41 SAODT 19 
stratum 5 59,878 19,631 275,502,518 10 10 1 19 

Neyman allocation 

Dy mean” variance ~“N, on, Ff, n 
stratum 1 1,273 878 57,260 87 2 O29 
stratum 2 2,336 1,701 99,688 81 2.0.02. 49 
stratuum3 4,619 3,114 351,547)» Oder 3ii0O5: 119 
stratum4 11,776 6,921 3,724,610 46 7 0.15 19 
stratum 5 59,878 28,418 426,851,844 5 5 1 19 
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To model the discrepancy between REV84 and RMT85, 
we use the log-linear model of section 3.1. There are 
outliers in the linear regression of log(RMT85) on 
log(REV84); they make the least squares estimates of B, Se 
and 0, 5 unrepresentative of the relationship between the 
two variables. Robust estimates obtained with the Splus 
function ImRobMM are used instead. They are given by 
Boe = 1.1 and 6,, =0.2116. Table 2 gives the stratified 
designs obtained with the generalized Lavallée and 
Hidiroglou algorithm for two allocation rules. They both 
give estimators of the total of RMT85 having a CV of 5.7%. 
This CV is still larger that 5%. Since there are outliers in 
the log-linear regression, the assumption of normal errors 
made in section 3.1 is not met. This might explain the 
failure to reach the target CV exactly. The increase in 
sample size for n = 19 to n = 28 is noteworthy! For both 
allocation methods the design obtained using the log-linear 
model has smaller take-all strata than Lavallée and 
Hidiroglou. 


Table 2 
Stratified designs obtained with the generalized Lavallée and 
Hidiroglou algorithm for the MU284 population using REV84 as 
stratification variable, a log-linear with Biog = 1.1 and 


Oj), = 0.2116 for the discrepancy between REV84 and RMT85, 


and a target CV of 5% 
Log-linear model stratification algorithm with power allocation 
with p =0.7 
b, mean _—- variance Ni, Winsett n 
stratuml 1,558 1,023 91 24) 12k. A. 30.03. 28 
stratum2 3,031 2,219 168,204 81 > 40.06.28 
stratum3 5,706 4,022 464,471 44 6 0.14 28 
stratum4 11,107 7,602 . §2;659106] 0832 = 79#0,2228 
stratumS 59,878 25,536 39,131,413 6 6 1 28 


Log-linear model stratification algorithm with Neyman 


allocation 
b, mean _— variance TY nar, Wola n 
stratuml 1,582 1,023 F245) 120 4 0:03 28 
stratum 2 3,040 2,219 168,204 81 50.06 28 
stratum3 5,608 4,022 464,471 44 5 0.11 28 
stratum 4 11,476 7,709 2,952,313 33 9 0.27 28 
stratum5 59,878 28,418 426,851,844 5 5 1 28 


An alternative to the generalized Lavallée and 
Hidiroglou algorithm for the construction of stratified 
designs is to us their original algorithm with a smaller target 
CV. This increases the sample size thereby reducing the 
variance of the estimator of the total of the survey variable. 
When constructing a design for RMT85 using REV84 as a 
stratification variable, the standard Lavallée and Hidiroglou 
algorithm with power allocation rule ( p = 0.7) and a target 
CV of 3.6%, yields a stratified design with n = 28. This 
design has the same sample size as those presented in Table 
2. The CV of the estimator of the total RMT85 is 5.7%, the 


same as the CVs obtained with the designs of Table 2. The 
main difference between these designs is the size of the 
take-all stratum. The design constructed with the Lavallée 
and Hidiroglou algorithm has a take-all stratum of size 
N, = 13 as compared to N, = 5 and N, = 6 for the designs 
of Table 2. Allowing the stratification and the survey 
variables to differ appears to reduce the relative importance 
of the take-all stratum in the sampling design. Further 
investigations are needed to ascertain this hypothesis. 

The stratification algorithm for the random replacement 
model of section 3.3 (with Neyman allocation) was also 
applied to REV84. Assuming changes in 2% of the units 
(€ = 0.02), the generalized Lavallée and Hidiroglou algo- 
rithm yields a stratified design with n = 37 sample units; 
the resulting estimator of total RMT85 has a CV of 5.5%. 
An interesting property of this stratified design is that the 
smallest sampling fraction is min, f, = 9.3%; it is much 
larger than min, f, for the designs of Tables 1 and 2. 
Despite the presence of outliers, the random replacement 
model does not describe the changes between REV84 and 
RMT85 as well as the log-linear model. This explains why 
a larger sample size, 37 instead of 28, is needed to get an 
estimator with a variance comparable to that obtained with 
the stratification based on a log-linear model. 


5. A METHOD FOR CONSTRUCTING 
STRATIFICATION ALGORITHMS 


The aim of a stratification algorithm is to determine the 
optimal stratum boundaries and sample sizes for sampling 
Y using the known values {x,;1=1,...,N} of variable X for 
all the units in the population. A model, such as those given 
in section 3, characterizes the relationship between X and Y. 
This section extends the stratification algorithm of Lavallée 
and Hidiroglou (1988) to situations where X and Y differ. It 
uses the log-linear model of section 3.1 to account for the 
differences between Y and X. Modifications to handle the 
random replacement model are easily carried out (see 
Rivest 1999), 


5.1 A Generalization of Sethi’s (1963) Stratification 
Method 


It is convenient to consider an infinite population ana- 
logue to equation (2.4) for n. Since the random variable X 
has a density f (x), the first two conditional moments of Y 
given that b,_,<X <b, can be written in terms of 


W,= J" Fodx.0,= [a fede, 


h-1 


and y, = f "+ x Bex) dx, 

by) 
where f is the slope of the log-linear model given in section 
3.1 (in this section B and o represent parameters of the 
log-linear model of section 3.1, since there is no risk of 
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confusion the subscript log is not used anymore). For 
stratification purposes, it is useful to rewrite (2.4) in terms 
of the conditional means and variances for Y, 


L-1 
ean Vela = Xia,» 


n= NW, + ? (Sat) 


L=1 
¥2c?+)) W, Var(Y|b,2X>b,_,)/N 
h=1 


where ay denotes the allocation rule written in terms of 
the known X. For instance, under power allocation, 
[ {W, EVO eke D,)y 
Ce 
, pel 


Yd {W, E(Y|b,2X>b,_,)}? 
k=1 


for h =1,...,L-1. Given a model for the relationship 
ierween” bY rs. and? 9X. .— NVar(Y |}. 2X > D7). and 
E (Y|b, = X>b,_,) canbe written in terms of W,, @,, and 
y,- Thus, the partial derivatives of n with respect to b, can 
be evaluated, for h<Z —1, using the chain rule, 


an __an OW, an 8% | dn Vp 
ab, oW, ab, 09, ab, dy, ob, 
P, on OW | on OP nt an OWaay 
OW,., 95, O9,.,; 9b, Wray 
Observe that 
aw, Wy 
SADE? : , 
ab, Bo =O) 
9, IP p41 B 
eae Pe) 
ab, ab, 
OW, OWns1 _ 1 28 
aon Se eee ep hep 
ab, pie I@,) 


This leads to the following result, for h<L -1, 
on 
pasadena 
3b, f(,) 


on _ On 1 on On pei on On p28 
Sarre s EES Beal dacs! cobs 7a 
OW, oW,., IQ, Py .4 Wy, Wy at 


Similarly, 


=f(D, 5) ps 


on t On 18 i on 
apg i 
OW, ., 9, 


ob 


T=} 


The Sethi’s (1963) algorithm is used to solve on/ob, = 0. 
It considers that the ue derivatives are proportional to 
quadratic functions in ph The updated value for b, is 
given by the largest coon of the corresponding quadratic 
function. When h<L ~1, this gives 
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De atts 
_| On __on /52 On __ On 
OP), Ppa My, Maat 
on __dn_\*_4f an __an_\/ on __on 
89, Pn sy OW, Wri) (| OW, OW, 
a) On __on 


while for h = L-1 we have 
i 
LANOnAM, OMis| my on ae N 
09,1 OP; Ow, 1 | OW,_, 


> on 
OW) -1 


The partial derivatives of n with respect to W,, Pn, and y, 
depend on moments of order 0, 1, and 2 of x? within 
stratum h. These moments are evaluated using the N 
x-values in the population. For instance, 
Qype 2 ye ah 
N iby, _4<X; Sb, 


1/2 


pee - 


Applications of this general method are provided next. 

When using Sethi’s algorithm, one typically has L > 3. 
Note however that it also works when L = 2. In this case, 
the algorithm is searching for the boundary ten a 
take-all anda take-some stratum. Successive evaluations of bP" 
presented above yield an optimal boundary. When one 
assumes that the stratification and the study variable 
coincide, i.e., X = Y, this boundary is nearly identical to 
that obtained with the algorithm presented in Hidiroglou 
(1986). 


5.2 An Algorithm for Power Allocation 
For the log-linear model of section 3.1, the conditional 
expectation is E(Y|b, > X>b,_,) =C@,,/W,, while the condi- 
tional variance is 
Var(Y|b, > X>b,_,) = C7 {e° v,/W, - (,/W,)°}; 
where C = ern" - +6°/2). Under the power allocation rule, 
ay, x = 0; OMe , 9, and formula (5.7) for n becomes 


lee A 
o 2 
gf, >, (e WW, Pn)! Op 
Rate Wel 


(S xP rv 2+ Soler ey, -@/W,)IN 


The partial derivatives needed to implement the strati- 
fication algorithm are easily calculated; for h< L-1, 
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an _ Ae*y,/o, AB(Q,/W, P/N 
OW, F F2 
an _ A{-pe(oW, v,-9,)/o, ~2/Qh }+pQr'B 
OQ, F 
AB Onn Wy) 
F?2 
On _ @AW,/%, 2 ABIN 
dQ, F F2 
where 
L-1 ibs 2 2 
A=Y oh, B-S' Whw, - 93) /o%, 
and 
Jui 


5.3 An algorithm for Neyman allocation 


Under Neyman allocation, allocation rule (2.3) written in 
terms of W,, @,, and VW, iS 


ge DU pe 
{e VW, sf 
XS Ge 
2) 1/2 
“sy {e Va W,, -o} 


and the formula for n is 


Be E ; 2 
pe wee Gn 
n = NW, + — 


L-1 ; 
os HIN} 62+ Se (e*y, -92/W,)/N 
h= 


The partial derivatives needed to implement Sethi’s 
(1963) iterative algorithm are, 


2 2 
h E F2 
AQ, Pf F2 
e°AW, | 12 
On _ ee WW - ©; } 
OW, F F2 


where 
oe 


pe alee v, W,-@ hiss 


and 


13 


- ley, = o3/w,) /N. 


| Se xP NJ c? + 


> 
| 


6. NUMERICAL CONSIDERATIONS 


Slanta and Krenzke (1994, 1996) encountered numerical 
difficulties when using the Lavallée and Hidiroglou algo- 
rithm with Neyman allocation: convergence was slow and 
sometimes the algorithm did not converge to the true mini- 
mum value for n. Indeed Schneeberger (1979) and Slanta 
and Krenzke (1994) showed that, for a particular bimodal 
population, the problem has a saddle; that is the partial deri- 
vatives are all null at boundaries b, which do not give a 
true minimum for n. 

When using the algorithms constructed in this paper, we 
also experienced the numerical difficulties alluded to in 
Slanta and Krenzke (1994). The algorithms constructed 
under power allocation were generally more stable than 
those using Neyman allocation; numerical difficulties were 
more frequent when the number L of strata was large. 
Furthermore, as the distribution for Y moved away from that 
of X, i.e., as o” increases, non convergence of the algorithm 
and failure to reach the global minimum for n were more 
frequent. In these situations, the stratification algorithm’ s 
starting values were of paramount importance. For instance, 
in Table 2, the design accounting for changes between Y 
and X obtained under Neyman allocation depends heavily 
on the starting values. The one presented in Table 2 uses the 
boundaries presented in Table 2 for the power allocation as 
starting values. Starting the algorithm with the boundaries 
obtained in Table 1 for the Lavallée Hidiroglou algorithm 
with Neyman allocation yields a different sampling design 
having n = 29. 

A good numerical strategy is to run the stratification 
algorithm for several intermediate designs to get to a final 
sampling design, with the stratum boundaries obtained at 
one step used as starting values for the algorithm at the next 
step. The log-linear algorithm is always run in two steps; 
first run the Lavallée and Hidiroglou algorithm, setting 
o =0, and use these boundaries as starting value for the 
algorithm with a non null o. Also use as starting value for 
Neyman allocation the corresponding boundaries found 
under power allocation with a p value around 0.7. 
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7. CONCLUSION 


This paper has proposed generalizations of the Lavallée 
and Hidiroglou stratification algorithm that account for a 
difference between the stratification and the survey 
variables. Two statistical models have been introduced for 
this purpose. The new class of algorithms uses the Chain 
Rule to derive partial derivatives and Sethi’s (1963) 
technique to find the optimal stratum boundaries. 

The log-linear model stratification algorithm proposed in 
this paper was used successfully in several surveys designed 
at the Statistical Consulting Unit of Université Laval. For 
estimating total maple syrup production in a year, the 
number of sap producing maples for a producer was a con- 
venient size variable. Historical data was used to estimate 
the parameters of the log-linear model linking sap pro- 
ducing maples and production volume. Another example is 
the estimation of the total maintenance deficit of hospital 
buildings in Quebec. The value of each building was the 
known stratification variable. The maintenance deficit was 
estimated to be in the range (20%, 40%) by experts. Solving 
46100 = log(40%) - log(20%) gives o,,, = log(2)/4 = 0.17 
as a possible parameter value for the fog-linear model of 
section 3.1. In these two examples accounting for changes 
between the stratification and the survey variables increased 
the sample size n by a fair percentage and yielded survey 
estimators whose estimated CVs were close to the target 
CVs. 

Two SAS IML functions implementing the algorithm 
presented in this paper, for power and Neyman allocation, 
are available on the author’s website at http: //www.mat. 
ulaval.ca/pages/Ipr/. They allow user specified starting 
values for the stratum boundaries; they can be used to 
implement the numerical strategies presented in section 6. 
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Multi-way Stratification by Linear Programming Made Practical 


WILSON LU and RANDY R. SITTER’ 


ABSTRACT 


Sitter and Skinner (1994) present a method which applies linear programming to designing surveys with multi-way 
stratification, primarily in situations where the desired sample size is less than or only slightly larger than the total number 
of stratification cells. The idea in their approach is simple, easily understood and easy to apply. However, the main practical 
constraint of their approach is that it rapidly becomes expensive in terms of magnitude of computation as the number of cells 
in the multi-way stratification increases, to the extent that it cannot be used in most realistic situations. In this article, we 
extend this linear programming approach and develop methods to reduce the amount of computation so that very large 


problems become feasible. 


KEY WORDS: PPS sampling; Proportional allocation; Random grouping; Survey sampling. 


1. INTRODUCTION 


In many practical survey situations, there are multiple 
stratifying variables available and thus the designer has the 
option of defining strata as cells formed as cross-classified 
categories of these variables. For examples, see Engle, 
Marsden and Pollock (1971), Hess, Riedel and Fitzpatrick 
(1976), Vihma (1981) and Skinner, Holmes and Holt 
(1994). This multi-way stratification often leads to 
situations where the desired sample size is less than or only 
slightly larger than the total number of stratification cells 
(particularly common when choosing primary sampling 
units (psu’s) in stratified multi-stage designs) and hence 
conventional methods of sample allocation to strata may not 
be applicable. 

An illustration, based on a hypothetical example of 
Bryant, Hartley and Jessen (1960), is given in Table 1. 
Communities (psu’s) are classified by two stratifying 
factors, type and region, with three and five categories 
respectively. The desired sample size of n = 10 is less than 
the total number of cells, 15. This example also illustrates 
a related problem. The entries in Table 1 are the expected 
counts under proportional stratification, i.e., the strata 
sample sizes are proportional to the population strata sizes. 
Under the sample size restrictions, the expected cell sample 
counts will not generally be integers. In cases with very 
small expected counts, rounding to integers will not lead to 
good choices while causing a serious violation of the 
property of proportional allocation. Non-integer margin 
totals are also typical and can cause their own difficulties. 
Goodman and Kish (1950) was the first to address this 
problem under the name of controlled selection, where they 
propose a sampling selection procedure which can be 
classified as random sytematic sampling (see Hess, Riedel 
and Fitzpatrick 1976; Waterton 1983). Bryant et al. (1960) 
presented a very simple method to randomly assign sample 


sizes for each cell in two-way stratification and gave two 
estimators based on that sampling scheme. However, since 
the expected cell sample sizes didn’t include information of 
proportion of each cell (i.e., the method is not a proper 
controlled selection technique, as only the probabilities of 
the marginal distributions are respected), these estimators 
may not have satisfactory MSE properties (see Sitter and 
Skinner 1994). Jessen (1970) points out that a further 
limitation of the method of Bryant et al. (1960) is that it is 
not possible to constrain specified cell sizes to be zero, 
which may be desired in some situations (see related 
methods under the label “lattice sampling”, e.g. Jessen 
1973, 1975). He proposes two methods for both two-way 
and three-way stratification but both methods are fairly 
complicated to implement and, as noted by Causey, Cox 
and Ernst (1985), may not lead to a solution. Inspired by the 
idea of Rao and Nigam (1990, 1992) in the context of 
avoiding undesirable samples (see also Lahiri and Mukerjee 
2000), Sitter and Skinner (1994) proposed a linear pro- 
gramming approach which attempts to take advantage of 
the power of modern computing. This linear programming 
technique is simple in conception, is flexible to different 
situations, always has a solution and has better properties of 
the MSE. Its main practical constraint is that it becomes 
computationally intensive as the number of cells in the 
multi-way stratification increases, quickly to the point of 
infeasibility. In this paper we will present a simple method 
which will allow the linear programming technique to 
handle much larger problems. In section 2 we describe the 
linear programming method of Sitter and Skinner (1994) to 
introduce notation and briefly discuss its numerical limi- 
tations. In section 3.1, we first discuss some simple strate- 
gies to reduce the computational intensity of the method as 
motivation for the eventual proposal. In sections 3.2 and 3.3 
we discuss the proposed method assuming integer margins 


! Wilson Lu, Doctoral Student, Department of Statistics and Actuarial Science, Simon Fraser University, Burnaby, BC, Canada V5A 1S6; Randy R. Sitter, 
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and give some examples with from 80 to 300 stratification 
cells to illustrate the ability of the new methodology to 
handle large problems. In section 3.4, we describe the 
simple extention of the method to non-integer margins and 
illustrate by applying the method to a real example from the 
occupational health literature (Vihma 1981). 


Table 1 
Example from Bryant et al. (1960). Expected Sample 
Cell Counts Under Proportional Stratification (n = 10) 


; Type of Community 
Region f 
Urban _— Rural Metropolitan Total 

1 1.0 0.5 0.5 2.0 
Z 0.2 0.3 0.5 1.0 
3 0.2 0.6 1.2 2.0 
4 0.6 1.8 0.6 3.0 
5 1.0 0.8 0.2 2.0 
Total 3.0 4.0 3.0 10.0 


2. THE LINEAR PROGRAMMING TECHNIQUE 


2.1 The Basic Ideas 


We introduce the linear programming method of Sitter 
and Skinner (1994) by considering the simplest kind of two- 
way stratification. Suppose that N units of a finite popu- 
lation are arranged in a two-way classification in R rows 
formed by categories of one variable and C columns by 
categories of another. Let N,, denote the number of popu- 
lation units in the i-th row and the j-th column (i.e., in the ij- 
th cell) of the two-way table and Pipe N;,/N denote the 
proportion of the total population in the ij-th cell. Let Y 
denote the mean value of a survey characteristic y for the 
population and Y;, denote the mean value of y for the i-th 
cell. 

The sample is selected as follows: 


i) Sample sizes n ;; are randomly determined for each cell 
according to a pre-specified procedure. Letting s denote 
the RxC array (n,;5 P=. ORIG H1 2. HO) pahis 
procedure assigns a probability p(s) to each s in the set 
S of possible such arrays and selects a single array, s, 
from S. We denote the dependence of n,, on s by 
writing n;(s). 

ii) A simple random sample of n(s) units is then selected 
from the i-th cell and the values of y obtained. 


Restrict attention to designs of fixed sample size n> 0, 
that is, restrict to arrays sé S, such that 
rine a n,,(s)=n. We would also like to restrict 
attention to proportionate stratification so that 


Denity LDS unten tOiisdin bec toad atl aen Gs LL) 
ses, 


which implies that the simple unweighted sample mean 


y(s) is an unbiased estimator of Y. We will refer to (1) as 
the expected proportional allocation (EPA) constraint. 

The linear programming technique of Sitter and Skinner 
(1994) chooses a sampling design p(s) which minimizes the 
expected lack of ‘desirability’ of the samples by solving the 
linear programming problem: 


min x w(s)p(s) (2) 
subject to the constraint (1), where w(s) is a loss function 
for the sample s, to be specified, and the p(s) are the 
unknowns. Sitter and Skinner (1994) were exploiting the 
key observation of Rao and Nigam (1990, 1992) in the 
context of avoiding undesirable samples, that the objective 
function in (2) was linear in the p(s)’s (see also Lahiri and 
Mukerjee 2000). 

In the objective function (2), the loss function w(s) plays 
an important role. With a well defined w(s), we have flexi- 
bility to explore the existence of an optimal solution to (2) 
within an economically sized S and, more importantly, to 
improve efficiency of estimation. Sitter and Skinner (1994) 
suggest choosing 

R c 
w(s) = > (n,.(s) - hye +> (n,(s) - nr (3) 
i= j=l 
where n,.(s) = yy (5), n,(s) = yi 7,,;(5), P.. = LP iy 
and P=), P;,. Obviously, the objective function (2) iS 
actually E(w(s)) for any given design p(s) and can be 
explained as the mean squared error of estimator y under 
an analysis of variance model (see Sitter and Skinner 1994). 
Then by solving the above linear programming problem, 
one can obtain minimized MSE in the sense of ANOVA 
while maintaining the EPA property of the nj(S). One 
should note that if a design with objective function equal to 
zero is obtained, then all margin constraints are met. This 
would typically only be the case with integer margins. 

Sitter and Skinner (1994) suggest that one simple way to 
reduce the size of S_ is to restrict the actual values that ni; 
can take to be either |nP,, | or |[nP,, |+ 1, where |nP,, | 
is the greatest integer less than or equal to nP,.. By 
denoting #;,=n,,-|nP,, | and r,,=nP,,-|nP,,], onecan 
then impose 


Ein aang (4) 


where n,,=O0 or 1 and O< r,,< 1. Then the linear pro- 
gramming method can be applied to the f#,, and finally 
LnP,, | +n, can be used as the actual cell sample sizes. 
Therefore, without loss of generality, we will assume that 


n, = 0,1 and 0< ry = nP,, <ylk (5) 
2.2 Higher-way Stratification 


The Sitter and Skinner (1994) approach extends straight- 
forwardly to more stratifying factors by letting s denote the 
corresponding r-way array. The loss function would then 
include more terms, for example for three-way stratification 
equation (3) could be replaced by 


Survey Methodology, December 2002 


R R, 
Wi War dog Me (Sir Pe) tha ysGS)i WP 0)! 
i=1 Hel 


R, 
+3), (n.,(8) -nP.,)? 
k=1 


in obvious notation, where y,, y, and y, might represent the 
relative importance of balancing on the three factors based 
on prior information (see Sitter and Skinner 1994). 


2.3 Miulti-stage Sampling 


An important application of multi-way stratification is to 
the selection of primary sampling units (psu’s) in multi- 
stage sampling, where it is more common to have several 
stratifying factors available. 

In section 2.1, the inclusion probabilities of each unit are 
E(n,,(s)/N;;) =n/N. If psu’s are selected with equal 
probability then the approach extends directly with the 
psu’s the units and with the observed values of y replaced 
by unbiased estimators of the psu totals. However, if the 
psu’s are to be selected with unequal probabilities, say 
NZixy for psu k in stratification cell ij (Zijx will typically 
equal M,,,/ isk M,,,, with M,., being some measure of 
size of psu k in cell ij), then the procedure can be easily 
modified by setting P,; equal to z,,./z..., where 2,,, = Zin 
and z... = aie Z,,- Then, if n,,(s) >0, a sample of n,,(s) 
psu’s in cell ij is selected by some probability proportional 
to Zi, method. 


2.4 An Example 


The linear programming approach can be illustrated 
using the hypothetical example of Bryant et al. (1960) given 
in Table 1. First, this problem is simplified as shown in 
Table 2 to meet the assumption in (5). Then, a standard 
linear programming package is used to solve this reduced 
problem (2). Because integer margins of expected sample 
cell counts can be exactly matched by marginal totals of 
sample sizes n,, and n,., which means that the loss function 
w(s) can acheive a minimum value of zero, the objective 
function in (2) for this example is also minimized at zero. 
The optimal solution of this problem is given in Table 3. It 
should be noted that this solution has been converted back 
to match the original example shown in Table 1. 


Table 2 
Modified Example from Bryant et al.(1960) 
Resin Type of Seca ame 
Urban Rural Metropolitan Total 

1 0.0 0.5 0.5 1.0 
2 O27 0.3 0.5 1.0 
3 0.2 0.6 OZ 1.0 
4 0.6 0.8 0.6 2.0 
5 0.0 0.8 0.2 1.0 
Total 1.0 3.0 2.0 6.0 
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Table 3 
Linear Programming Solution to Example 
from Bryant et al. (1960) 


s p(s) s p(s) s p(s) 


Tab thi Fates O) Lorrekse 0 PP RerTos 0 
| Na 6 gant OC] moe ese if 
Ont heist O20 Sal od OAR PaO 3Orat2 U2 
One Zod hoonins td ASD 21-0 
197 (Ou Ind frericaco Le LeO 
Li O70 ESA AG CM 
Oaarpet d OPEL? aed 6 Ria | 
POT I ay Oa Omri SP oO iheee Ore Prd 0.2 
1 ah tial | onc aaa (ere nO 
ain sama | stad ‘aie ool coal | 


The linear programming method is simple and easy to 
use. Its main drawback is computational. The number of 
parameters in the resulting linear programming problem is 
the number of samples of size n from the RC>n cells, 
(n°), which becomes infeasibly large quite quickly. In the 
next section we will explore ways of improving the 
computational efficiency of the linear programming 
approach while maintaining all of its good properties. 


3. THE LINEAR PROGRAMMING APPROACH 
MADE PRACTICAL 


The basic idea of the linear programming approach is to 
obtain an optimal sampling design in terms of the 
(minimum) expected lack of “desirability” of the sample by 
directly solving a linear programming problem with p(s), 
séS, as the unknowns while maintaining the EPA pro- 
perty. The only obstacle to this approach is that the number 
of elements in S_, is often very large and even with modern 
computing power it becomes difficult to carry out linear 
programming if the number of unknowns is large. 

To reduce the magnitude of the computational task for 
this linear programming problem determined by the cardi- 
nality of S,, we want to obtain a subset of S,, say So, 
which is nearly as representative as S, but much smaller, 
and thus solve the following linear programming problem 
with a much smaller set of p(s), s€ S,,, as the unknowns: 


min y2 w(s)p(s). (6) 


SES 9 


Hopefully, in this way we can easily deal with larger 
practical problems without losing the good properties of the 
linear programming approach. 


3.1 Some Motivating Strategies 


The above strategy is easy to state, but it turns out not to 
be entirely obvious how to go about it. In fact, there are 
several different directions we can explore to determine 
such a suvset $9 ¢ pigs 1 this section, we will describe a 
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basic method related to loss functions which was alluded to 
in Sitter and Skinner (1994) and describe how it modestly 
increases the size of problems that can be handled. We will 
then discuss some obvious directions to take which did not 
improve things much. By describing these misguided 
attempts, we motivate the eventual proposal. 

The major flexibility of the linear programming 
approach is derived from the choice of loss function w(s). 
Thus, it is natural for us to consider the loss function first 
when we try to improve the computational efficiency of this 
approach. By observing the objective function of the linear 
programming problem (2), we suspect that the loss function 
w(s) as coefficients of unknowns p(s) will not be very large 
when the objective function has been minimized. In other 
words, all positive p(s) in an optimal sampling design will 
only be assigned to samples having small lack of 
“desirability”. Based on this observation, we hypothesize 
that the following subset might be a good replacement for 
Si. 


n 


R 
Sio ={8€S,:w(s) => (n,(s) -nP,.)? 
i=1 


C 
me (n.;(s) Pps <iweks ) 


where w, is a pre-determined positive constant. In the case 
of integer margins, one could even let W, = O and restrict to 
samples where the margins are matched. For example, the 
solution in Table 3 assigned positive probability to only 6 
samples and for each of these the objective function was 
zero. 


Lu (2000) develops nested linear programming strategies 
for solving this problem. For moderately sized problems 
such as 8 x 5 arrays (i.e., 40 cells) this approach does well. 
However, for larger problems the size of resulting candidate 
sets becomes large very quickly, even in the integer margin 
case. Thus for large problems the technique faces the same 
problem as before-a huge candidate set that results in the 
difficulty of solving a linear programming problem with too 
many unknowns. 

In reality, even a candidate sample set So Of the form in 
(7) is far larger than necessary for us to find an optimal 
solution. What we really need is a smaller but fairly 
representative subset, where by “small” we mean small 
enough to make it possible to solve the resulting linear pro- 
gramming problem and by “representative” we mean 
containing elements which promise that this linear 
programming problem is feasible. 


Before going on to describe our eventual proposed solu- 
tion to this problem, we would like to introduce some naive 
methods of obtaining such a “representative subset” that 
turned out not to work well. These are not that useful in 
practice, but they did inspire our thinking in proposing a 
more sophisticated approach. 


Lu and Sitter: Multi-way Stratification by Linear Programming Made Practical 


1) Two Stage Optimization: First of all, we could try to 
break S_, in (7) into many subsets which are small enough 
to be handled by linear programming respectively. Hope- 
fully, optimal solutions from each of these smaller sets in 
the first stage optimization procedure can be combined to 
form the desired representative set of samples. Then we can 
just collect these optimal solutions together and apply linear 
programming once more. We applied this method to some 
simulated examples of size 6 x 6, 7x7, 8x8 and 9x9 
as a method of preliminary investigation of its potential. 
Generally, in the first two cases the method worked very 
well and quickly, in the 8 x 8 case the method was time 
consuming and was not always able to obtain optimal 
solutions, and in the 9x9 case the method became 
infeasible. 


2) Resampling from S,,: We could also randomly select 
a proportion, say 10%, of the S , in (7) and hope this 
proportion is statistically representative of the complete set. 
Unfortunately, simulation results showed that the pro- 
portion obtained in this way is not “representative” enough, 
and the resulting linear programming problem often does 
not have any feasible solution. For example, the method of 
nested linear programming discussed previously was able 
to obtain matched integer margin solutions for simulated 
8 x 5 arrays, however, these solutions were obtained much 
quicker by repeatedly sampling 10% of So and applying 
the Sitter and Skinner (1994) method to this set until a 
feasible solution was obtained. However, when slightly 
larger cases were considered the method took an inordinate 
amount of time before finding a feasible solution, and 
quickly became impractical. 

There are two problems with both these approaches. 
First, the size of S_ becomes huge combinatorically and 
even complete enumeration becomes difficult. Having to 
first obtain S_, and then cutting the problem into pieces 
will either quickly outstrip the practical limits on linear 
programming due to the size of the pieces or create a huge 
number of pieces. Second, both of these strategies are not 
in any way attempting to avoid samples which are parti- 
cularly bad choices for meeting the EPA constraints. The 
question is, is there any way we can generate a fairly 
“representative” candidate sample subset without choosing 
such “useless” samples or, more generally, can we select 
candidate samples in which the frequency of an entry’s 
appearance is more or less related to its desired expected 
sample counts?, and also can we do so without first having 
to enumerate a large S\,? The general idea revolves around 
the fact that if we could randomly select a candidate subset 
directly from S, without complete enumeration using an 
unequal probability selection procedure which simulta- 
neously ensures that the objective function is minimized for 
every sample while ensuring that the EPA property is 
Satisfied we will have solved the problem without resorting 
to linear programming at all. We have been working on 
finding such a selection procedure, but have yet to succeed. 
What we have been able to do is to develop such a proce- 
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dure with approximate EPA (AEPA). We can then use it to 
randomly generate a candidate subset of samples, S_,, and 
then apply a linear programming technique to this subset. 


3.2 A Sampling Procedure with AEPA Property 


In this section we first describe the approach as it applies 
to the case of integer margins. That is, the column totals, 
n., = er Tips and the row totals, n,. = Net r,,, are integer 
valued. We go on to discuss how it can easily be adapted to 
the general case. In the linear programming approach, the 
goal is to minimize the expected lack of ‘desirability’ of the 
samples while maintaining the EPA property. We propose 
to accomplish this in two stages. First, we will develop an 
unequal probability selection procedure which selects 
samples which exactly match the integer margins and also 
have the AEPA property. We will then randomly generate 
a moderately sized set of such arrays and then apply a 
modified linear programming technique to this subset of all 
possible arrays. This will be repeated with larger and larger 
such sets. We will describe the sampling procedure and 
then we will discuss the modified linear programming 
technique. 

Here is the basic idea for constructing such a sampling 
procedure: for a two-way table (assuming the expected cell 
sample sizes have been adjusted to lie between 0 and 1 as 
was done in going from Table 1 to 2), first we draw a 
sequence of population cells to produce @,,, 4,5, ++» @¢ in 
the first row using an unequal probability without replace- 
ment sampling procedure based on the expected counts of 
that row, where a,, = 1 if the ij-th cell is selected and = 0 
otherwise. Then we draw a, , d;,, -.-, 4;c Subsequently for i > 1 
while keeping all ),_,a, ,; less than or equal to the 
corresponding marginal column totals n,. The details of 
this sampling procedure are as follows: 


Step 1: Randomly permute the rows and let i = 1. Given the 
first row of inclusion probabilities r,,,7,5, +. 1c; draw a 
sample of n,. cells out of C in the first row stratum using an 
unequal probability without replacement sampling proce- 
dure; record the first row of samples in terms of indicator 
variables @,,, 4,5, +3 4,¢ as defined previously; let A, =< 
for p=uy!.., C2 

Step 2: Let i =i+1 

Step 2.1: For j =1,...,C, do the following 

a) Let Ri = Vp ea Nyy 

b) If R; ~A,<O let a; =0, 

c) If R,-A,2 1 let aj; = |p 

Step: 2.2: “Let J={j:0<R. A and 01 = 
viet ~ Hs: 4, sls be Senge Lins cay iaprve 2 ade 10 a ate 
r,,x rtot/)’,.;7,, for je J. If there existsa jy € J such 
that r,/, > 1 then let Aijo = 1 and go to Step 2.1. Otherwise 
go to Step 3. 

Step 3: Draw a sample of rtot cells from J using an unequal 
probability without replacement sampling procedure and 
r,, to get a, for je J. 


lj 
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Bet Ane lyrer appfor dis won 
Step 4: If i = R, then stop; otherwise go to Step 2. 


One aspect of this sampling procedure that should be 
noticed is that in Step 2, the way of re-calculating the i-th 
row of inclusion probabilities is not unique. However, the 
general rules that should be followed for this re-calculation 
are: 


(a) O<r, j < 1 and if A; =N;, which means that there are 
enough units being selected from the j-th column, r,; 
should be set to 0; if A,=n,.-(R-i+1), which 

j : 
means that there will not be enough units to be selected 
for this column unless all of the remaining units are 
selected, r; j should be set to 1; 

(b) keep Bea ri; ze Dial rj = Nj. 

The method extends easily to non-integer margins. We 
delay detailed discussion, however, to the sequel. 

We can now use the above method to generate a 
candidate set, S.,, and apply the linear programming 
technique to this set. To see why we choose to modify the 
linear programming technique, realize that for the integer 
margin case every s€ S_, already attains the minimum in 
(2) so that a direct application of linear programming 
amounts to determining whether there is a feasible solution 
or not. Thus, if we generate say an S$, of size 500 then 
1,000 etc, and the linear programming package continues to 
find no feasible solution we really do not know if we are 
getting closer to a solution or not. Instead we choose to turn 
the optimization around and solve a dual problem 


min >| )) n,(s)p(s) -7,;\- 8) 
P(s) i,j SES, 


We know that w(s) =0 forall se S,, and we are looking 
for a solution which yields a minimum of zero in (8). We 
have essentially switched the roles of the objective function 
and the EPA constraints in the original problem. The diffi- 
culty is that it is more difficult to use linear programming to 
handle (8). This can be done as follows. Set up constraints 
ss n(s)p(s) -7;; + d;,—€,,=0 TOlg = Eds 
SES. 9 
and pe ten, Cy () 
where 
d;,20,¢,,20,d,,e;, = 0. (10) 
Then note that 
|) 2,,(s)p(s) - 1, = “i x 1255 ES Uae 
ae ij W le, if Da, Bey (s)p(s) - 7,2 0 


0 
=d..+e.. (11) 
Tat 
Thus, we can replace (8) by 
min (dj * €), (12) 


P(S), Gip Cj; i,j 
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subject to 


Y 1,,(s)p(s)-1,, +d, -e, 


SES, 9 


=0,d;, x? jp P(s)20, d;, ei 


=H (13) 


3.3 Some Illustrating Examples with Integer 
Margins 


In this section, two examples will be used to illustrate the 
sampling procedure. The first with a 10x8 array is 
described in detail to show the whole procedure. The 
second with a larger size (20 x 15) is given to demonstrate 
the size of problem that this method can handle (this is near 
the limit of the problem the proposed method can realisti- 
cally handle). Any unequal probability without replacement 
sampling procedure can be used within the method. In 
Example 1 below, we chose to use the the random grouping 
method of Rao, Hartley and Cochran (1962), since it is 
simple and we really only need to approximately match the 
selection probabilities, which it does. However, the Rao- 
Hartley-Cochran method only works well up to problems of 
moderate size. In Examples 2 and 3 one should use a 
method which exactly matches the selection probabilities. 
There are many such available, but we chose to use one 
developed in Lu (2000). 


Example 1. 10 x 8 array with integer margins: A two- 
way Stratification problem with expected sample cell counts 
and sample size is given in Table 4. 


Table 4 
Expected Sample Cell Counts Under Proportionate 
Stratification (n = 40) 


Column No. Marginal 
Row No. — 

1 2 3 4 3) 6 7 8 Row Total 

1 0.41 0.55 0.58 0.80 0.23 0.61 0.70 0.12 4 

2 0.52 0.15 0.07 0.90 0.28 0.10 0.37 0.61 3 

3 0.72 0.15 0.65 0.73 0.39 0.34 0.85 0.17 4 

4 0.70 0.55 0.46 0.10 0.41 0.05 0.24 0.49 3 

5 0.07 0.63 0.45 0.81 0.52 0.02 0.70 0.80 4 

6 0.61 0.33 0.79 0.21 0.02 0.61 0.67 0.76 4 

7 0.88 0.48 0.73 0.69 0.44 0.64 0.86 0.28 5 

8 0.22 0.14 0.85 0.37 0.69 0.45 0.49 0.79 4 

9 0.85 0.44 0.80 0.76 0.31 0.71 0.60 0.53 5 

10 0:02 20:58 016290635071 0:47 70:52) 0:45 4 

ee iit enhin OMe CeCe 40 


The basic steps of our sampling design are illustrated as 
follows: 


Step 1. Obtain a representative candidate sample subset 
So by using proposed sampling procedure with AEPA 
property to draw, say 500, samples (obtained within 3 
minutes). The sample proportion of each cell is shown in 
Table 5, which can be compared to Table 4 to see how 
close these are to satisfying the EPA property. 


Step 2. Solve the linear programming problem given by 
(12) and (13) to obtain 
min am D> Re AS\DIS\ rT beale 
PO TeSh tage Iw 2 (14) 
If the objective value of (14) is greater than zero, repeat 


Step 1 with a larger set S,. If the objective value of (14) is 
zero, stop, an optimal solution has been obtained. 


Table 5 
Sample Cell Counts Under Prop. Stratification (n = 40) 

BawNG: Column No. Marginal 
1 2 3 4 5 6 7 8 Row Total 

1 0.408 0.554 0.582 0.776 0.250 0.594 0.734 0.102 4 

2 0.554 0.150 0.062 0.916 0.280 0.122 0.366 0.550 3 

3 0.690 0.144 0.638 0.720 0.402 0.360 0.838 0.208 4 

4 0.692 0.542 0.452 0.120 0.416 0.044 0.260 0.474 3 

5 0.060 0.602 0.446 0.814 0.568 0.016 0.708 0.786 4 

6 0.558 0.348 0.780 0.216 0.012 0.634 0.682 0.770 4 

7 0.866 0.480 0.734 0.676 0.470 0.664 0.842 0.268 5 

8 0.254 0.158 0.848 0.400 0.654 0.412 0.490 0.784 4 

9 0.870 0.418 0.830 0.772 0.292 0.692 0.624 0502 5 

10 0.026 0.564 0.636 0.658 0.714 0.416 0.500 0.486 4 

Marginal: 6 Heals or 6) Daneiin eae SVes 40 


Col Total 


In this example, a candidate subset S_, with 500 samples 
was sufficient to get objective value of 0. 


Example 2. 20 x 15 array with integer margins: In this 
example, a 20 x 15 array with integer margins is given in 
Table 6. 

The actual computation steps are given as follows: 


First Iteration: 


Step 1. 
Step 2. 


Draw 500 samples to form S_, 
The objective value of (14) i is 0. 1659. 


Second Iteration: 

Step 1. Draw 500 samples to add to S_, 

Step 2. The objective value of (14) is 0. The final 
sampling design is attained. 


This procedure took approximately 30-60 seconds using 
a Fortran program on a Sun Ultra 10 workstation. 


3.4 Extension to Non-Integer Margins 


The method extends easily to non-integer margins. 
Merely replace n,, throughout the algorithm by n;, which 
takes value |r, nF +1 with probability a =r, -Lr, _| and 
takes value [7,.| with probability 1 - a. The only addi- 
tional difficulty is that E[w(s)] cannot attain zero. Thus, 
we do not have an obvious lower-bound reference point to 
ascertain whether we are close to the best solution or not. 
However, the above randomization strategy ensures that for 
every obtained AEPA sample we have 
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[n;.(9) -7,1<1 and |n,(s)-r,|<1 


for f=1, ok j- 1.20 


This together with the EPA property, E [n;;(s) ] - 
ie n,,(s)p(s) = 7; implies that the lack of desirability 
function w(s) defined in (3) has a constant expectation 


E[w(s) ] =») (7. -LrJ)( ope r;.) 
ee Ui peleteal Nldealtis) Shea 


J 


(16) 


The proof of this is given in Appendix 1. Thus, if (14) 
attains zero under the above strategy then the resulting 
solution will yield minimum E[w(s) ] as in (16). 


Example 3. 27x3 real example with non-integer 
margins: We will illustrate the method using a real 
example from environmental health (Vihma 1981). This 
study was concerned with occupational health of workers in 
various industries in Finland. The population chosen for 
study consisted of 1,430 small industrial workplaces (5-49 
employees) totalling 22,893 employees in Uusimaa, the 
southern most and most industrialized province of Finland. 
The primary sampling units were the workplaces and a 
sample of n=100 such were desired. This was all that could 
be afforded given the cost of the eventual survey. The 
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workplaces were stratified by two stratification variables: 
type of industry (27 categories) and number of employees 
(3 categories). The expected sample cell counts under 
proportionate stratification are given in Table 7. The actual 
sampling scheme used in this study was based on the 
method of Bryant et al. (1960) after some grouping strata as 
it was the only method available at the time of this study. 

We applied our method to this problem. The minimum 
achievable E[w/(s) ] using our proposed strategy is 5.0418. 
The actual computation steps were as follows: 


First Iteration: 


Step 1. Draw 500 samples to form S_), randomly 


generating the n; independently for each 
sample. 


Step 2. The objective value of (14) is 0.45088. 


Second Iteration: 


Step 1. 
Step 2. 


Draw 500 samples to add to S_,. 


The objective value of (14) is 0. The final 
sampling design is attained and achieved the 
minimum value E[w(s) ] = 5.0418. 


This procedure took approximately 30 seconds using a 
Fortran program on a Sun Ultra 10 workstation. 


Stratification (n =151) 


Table 6 
Expected Sample Cell Counts Under Proportionate 


0.73 0.58 0.08 059 0.69 0.84 0.04 0.17 
4s 089) 0:35 ~ 057 ~ 0.35 0.38 20.47 0.53 
073 025 0.15 0.73 048 0:32 091 0.49 
O43. 7°0.28 0.35-0:60% *0:26- 50.38 0.37 0.39 
0:32 = 90.06 -0:86-'0.47_ 0.30 0:93 0:96 0.30 
O12. n0is 081 10.34. 028. 0.02... .0.39 0.41 
0.48 . 0551 0:50 20620; W35qnni0.hha 10:85 0.78 
0.86. OAl, O.)) 700-17; «0.755 wO.89 - 0:48 0.48 
O.84.01.0,0021d0:130.93.% o0.36or) On12en0A9 0.86 
0.82. 0.229 :0:54,.0.82°"=016b "0:46" 0:74 0.33 
0.951} 0560. 90:351.1033a8 -0.952-0:43™ 0.06 0.63 
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O75, 0:65 0.03 "0.04 032-036" 055 0.80 
0.79. .031 0.55 0:26 0.04. 0.05. 0.91 0.11 
0.23 0.92 0.81 0.42 0.49 0.10 0.74 0.56 
O43) 0.77. 0:65,,.0:665...005,.6.0:23.5 0.58 0.74 
031500.01. 40.60 10:38; O01 ag 0:55:55. 0:70 0.72 
0:63 © 0.67...0.21 » 0,020.16. 0.68 » 0.14 0.17 
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0:27 0:806 0.02.6 O84. 0.79 nok 003105307 
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Table 7 
Occupational Health Survey, Vihma (1981) Expected Sample 
Cell Counts Under Proportionate Stratification (mn = 100) 


Type of Industry Number of Personnel 
Sas teehee OREO at 

Food products CO BO td. 29.2 
Food O35. a Ot toe 56. 105 
Beverage 0144. 0,07. 20.21. 0:42 
Textiles 133 oo Le 26u. lh6 9 214.05 
Apparel I Se I Be 8 A, BR 
Leather 0.56 0.14 0.07 0.77 
Footwear 0.07. 0,07 “602 dent O35 
Wood Products PRS Pas 0 Es aR ey 
Furniture 133 084 0.91 3.08 
Paper Products 0.42 049 0.42 1.33 
Printing 1.20 ,6.0F 9-420" (17:4% 
Industrial Chemicals OSG 3509 0:28 ok 19 
Chemical Products 1.825. 545 al:53. 0 4.89 
Petrolium 0.14 0.079 30.00% £021 
Misc Coal and Petrol. 0.07 0.07 0.14 0.28 
Rubber Products 0.14 0.21 0.07 0.42 
Plastic Products P40" Se 105— TIltD s 3.64 
Glass Products OA OR ek. (O84 
Non-Metal Minerals 112 0.98 0.84 2.94 
Iron & Steel 0.14 0.07 O35 0.56 
Nonferrous Metal OS0>= Ue = O28 0/7 
Fabricated Metal 4.96 406 2.59 11.61 
Machinery 2.80) 196" SS 1.97 
Electrical Eso" 1.60" “1.33 4:82 
Transport Equipment 0.84 0.84 0.84 2.52 
Scientific Equipment O56 28042 oe: A9 i “1 47 
Manufacturing Industries L.6594/0.99 O98Y | 3.57 
1. 38.19 32.66 29.15 100.00 


ay 


5. CONCLUDING REMARKS 


We propose a method for two-way stratification which 
extends the applicability of the linear programming 
approach of Sitter and Skinner (1994) to much larger 
problems. The method focuses on how to construct a small 
“representative” candidate sample set by using an unequal 
probability sampling procedure which generates candidate 
samples which nearly meet the AEPA constraints of the 
linear programming problem and then applying the linear 
programming method to this much smaller set. 

It should be noted that the linear programming method 
extends easily to stratified multi-stage designs. Since there 
is no fundamental difference between the original linear 
programming approach and the extension proposed here, 
this is still true of the proposed method. In the same spirit, 
one can view discussion on issues around variance estima- 
tion of the resulting estimators in Sitter and Skinner (1994) 
as well. 

One should also note that once one restricts to bracketing 
integers around the nP ;, *s, the problem is related to a 


controlled rounding problem (see Kelly, Golden and Assad 
1993, and references therein), though we do not explore this 
aspect here. 
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APPENDIX 1 


Proof of (16): n;.(s) -[7,.|~ Bernoulli(r,.- Lr, |) and 
has variance (r,. - |7;.|) (1 +L7;,.] - 7,.). This implies 


» (2,0) - 7, Pp) = E(,.@) - 7.) V (4,6) 


S 


= V(n,,(s) : L7,.J) 
=(7;. ge )(1 ia ee ie 


and by similar argument that Y(n,,(s)- 1, ¥ 
Pid aCrg ala Chang eg): 
Therefore, with w(s) defined in (3), 


Elw(s)]= 7 w(s)p(s)= 0 {2 b,.(s)-r,.P +0 (1.;(9) ri} }P@ 
iS s J 


=e ye (n;.(s) = ) p(s) + S >> (n.,(s) * r;) P(s) 
=D a aL let Lepr) nd (gsleael at alata) 
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On the Use of Generalized Inverse Matrices in Sampling Theory 


ROBBERT H. RENSSEN and GERARD H. MARTINUS’ 


ABSTRACT 


In theory, it is customary to define general regression estimators in terms of full-rank weighting models, /.e., the design 
matrix that corresponds to the weighting model is of full rank. For such weighting models, it is well known that the general 
regression weights reproduce the (known) population totals of the auxiliary variables involved. In practice, however, the 
weighting model often is not of full rank, especially when the weighting model is for incomplete post-stratification. By 
means of the theory of generalized inverse matrices, it is shown under which circumstances this consistency property 
remains valid. As a non-trivial example we discuss the consistent weighting between persons and households as proposed 
by Lemaitre and Dufour (1987). We then show how the theory is implemented in Bascula. 


KEY WORDS: Bascula; General regression estimator; Weighting. 


1. INTRODUCTION 


Weighting methods that are based on the general 
regression estimator are commonly used in sample surveys 
to adjust for both sampling error and non-sampling error, 
see e.g. Bethlehem and Keller (1987) and Sdarndal, 
Swensson, and Wretman (1992). One complication in the 
use of general regression estimators, however, is that many 
weighting models are based on incomplete post-stratifica- 
tion, resulting in design matrices that are not of full rank. 
Usually, this problem is solved by using a reduced design 
matrix. Such a reduced design matrix can be constructed by 
deleting redundant columns and properly adjusting the 
population totals. Often, the redundancy can be recognized 
rather easily beforehand by the specification of the 
weighting model. However, for some weighting models 
such a redundancy check may be impractical. 

For example, suppose that we have a post-stratification 
based on the complete crossing between two categorical 
variables A and B, with known counts for the population of 
each cell. We may obtain small sample counts or no sample 
in some cells. Then we may derive new classifications, A’ 
from A and B’ from B, by merging categories, and define 
the following more parsimonious scheme: A + B + A’ x B’. 
According to this incomplete post-stratification we simul- 
taneously calibrate on three sets of counts, namely the mar- 
ginal counts of A, the marginal counts of B, and the cell 
counts of A’ xB’. Since A and A’ (and also B and B’) 
appear in different weighting terms, it is difficult to reco- 
gnize redundancy by the specification of the weighting 
model. This paper gives the theoretical background, which 
is based on generalized inverse matrices, of reducing such 
a design matrix. 

In section 2 we briefly describe some properties of 
generalized inverse matrices. In section 3 we define the 
general regression estimator for weighting models that need 
not be of full rank. Given a regularity condition that can be 


nicely interpreted in a calibration estimation context (see 
Deville and Sarndal 1992) it is shown that this estimator is 
invariant with respect to the choice of the generalized 
inverse. At the end of section 3 the fulfillment of this regu- 
larity condition is discussed for some well-known 
weighting models, such as incomplete post-stratification 
and consistent weighting between persons and households. 
In section 4 we describe the algorithm, which is imple- 
mented in Bascula (see Nieuwenbroek 1997; Renssen, 
Nieuwenbroek and Slootbeek 1997) for calculating the 
regression weights. Finally, in section 5 we briefly discuss 
the weighting model of the Dutch Labour Force Survey. 


2. GENERALIZED INVERSE MATRICES 


We are mainly interested in the use of generalized 
inverses within the framework of the general regression 
estimator. Hence, we only give some properties of a gener- 
alized inverse of the form X‘ A X, where A is a diagonal 
matrix of order n x n with strictly positive diagonal entries 
and X a design matrix of order n x p that results from the 
weighting model. For a more extensive discussion on 
generalized inverse matrices we refer to Searle (1971) and 
Rao (1973). 

Before giving these properties, we briefly review the 
definition of a generalized inverse. Consider a p x q matrix 
A of any rank and let Ax =y be a system of consistent 
equations, i.e., any linear relationship existing among the 
rows of A also exists among the corresponding elements of 
y. A generalized inverse of A is a q x p matrix A such that 
x =A” y isa solution of this system of equations. It is easy 
to verify that the existence of A~ implies AA~A =A 
(choose y as the i-th column of A). Conversely, if Aé 
satisfies AAA =A and Ax-=y is consistent, then 
A(A-y)=A (A7Ax)=Ax=y and hence A’y is a 
solution. Thus, as an alternative definition, a generalized 


1 Robbert H. Renssen and Gerard H. Martinus, Department of Statistical Methods, Statistics Netherlands, P.O. Box 4481, 6401 CZ Heerlen, The Netherlands. 
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inverse matrix of A is any matrix A~ such that AA~A =A. 

Now, if G denotes a generalized inverse of X‘ A X, then 
the following properties of G are proven in Searle (1971) 
for A-=1: 


(P1) G’ is also a generalized inverse of X‘ A X, 


(P2) XGX'AX=]=X. ie, 
inverse of X, 


(P3) | XGX°’ is invariant to the choice of G, 
(P4) XGX' 


GX'A is a generalized 


= XG’ X' whether G is symmetric or not. 


The proofs of (P1) to (P4) for diagonal are almost identical 
to those of Searle (1971, chapter 1.5, theorem 7) and there- 
fore not repeated here. 


3. THE GENERAL REGRESSION ESTIMATOR 


Consider a finite population U of N units from which a 
sample S of n units is drawn without replacement. Let Tl, 
denote the first order inclusion probability of the k-th ‘thie 
k = 1,..., N. We associate with each unit a vector of study 
variables y,- Then, the data matrix for the sampled units is 
given by Y, =(y,,...,y, )’. We distinguish between study 
variables with known population totals (auxiliary variables) 
and study variables with unknown population totals. The 
Start in the definition of a general regression estimator 
(Sarndal et al. 1992) is the specification of the weighting 
model, i.e., the choice of the set of auxiliary variables to be 
used in the estimation. Denoting this specific set of p 
variables by x, we call the n x p matrix X, = (x,, ..., x Bake 
the design matrix, which is, by definition, a hin subset 
of Y,. The vector of known population totals of x is 
denoted by t,. Let x, =),<5%™ X, denote the Horvitz- 
Thompson estimator for t,, then, given x, the general 
regression estimator of the Vector of population totals of the 
i-th study variable y “ is defined as 


A) (i) DP 
fi Jee ya Balt fe Ry (1) 


with 
n () 
B = GX, A, Y¥. 


In terms of regression weights, this general regression 
estimator can also be written as 


Mi) ever (i) 
tee = yw WieYi (2) 
keS 
with 
Ww, = +h 5 Meet deed). 


Here, G, denotes a generalized inverse of Xe A,X, and 
Deis = diag (2, .. A,,) is some diagonal aie Sah Strictly 
oe entries. 

Like the weighting model, the diagonal matrix A s has to 


be specified by the user. Often, one takes Ns =II, cB Sse ; 
where IT, = diag (7,, ...,7,) and )’, = diag(o”, «. ..., 0, ) With 
0, interpreted as the variance of independent random 
variables of which some of the study variables are supposed 
to be the outcome according to some super- population 
model, see Sarndal et al. (1992). It is required that all o% be 
known up to : common scale factor. An important special 
Case is g, = = 6”, i.é., all the modeled variances are the same. 
This results in the regression estimator proposed by 
Bethlehem and Keller (1987). If the population units 
represent households (of size m, ) and if we take o, =m, ee 
we arrive at the estimator proposed by Lemaitre and Dufour 
(1987) to obtain consistent weights between person and 
households. From a different point of view, Alexander 
(1987) derived the GLS-P estimate, which results in 
essentially the same estimator. 

Below we show that the regression weights are invariant 
to the choice of G,. To that purpose we make the fol- 
lowing assumption: 


(Al) _ there exists a n-vector w such that be Wats 
Clearly, this assumption states that x w =t, is a system of 
consistent equations. It is interesting to note that this system 
precisely corresponds to the set of calibrations equations 
when considering the general regression estimator as a 
special case of the calibration estimator (see e.g. Deville 
and Sarndal 1992). If X's w =t. is a system of consistent 
equations, then so is Xs we Cait eed) This 1 iS easily seen 
by taking v=w- d. with d dedeie cage ty oe The 
invariance of the aes sellane to the Choice of G,, 
and hence the invariance of the general regression Ee 
can be shown as follows. Let F , be some other generalized 
inverse of xe A,X. dieterent from G,. Then, we have 


XG s(t, +xE,) =X,G,Xsv by (Al) 
=X,F,X5Vv by (P3) 
=X,F,(t, -Xpr)- by (Al) 


So, itholds that x;, G , (t, —X,,,) isinvariantto G , forall ke S, 
implying that the regression weights are invariant to the 
choice G,.. 

The fact that these weights reproduce the population 
totals of the auxiliary variables follows from the following 
series of equations: 


Yd w,x, = Xu t+ 2 xh 


keS 


G,(t, —Xyp) 


= ppt (X5A,X,)G,(t, “Xyr) 


= Xp + (KsA,X,)G.X5V__ by (Al) 
=X + X5V by (P2) and (P4) 
=X, + (t, - 5,7) =t.. By (Al) 
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We close this section by having a closer look at the 
stated assumption for some well-known weighting models. 
In case of post-stratification in which the weighting model 
is described by a complete crossing of categorical variables, 
(A1) has a simple interpretation. Namely (A1) is satisfied 
if and only if empty post-strata in the sample correspond to 
empty post-strata in the population. Next, we consider 
incomplete post-stratification in which the weighting model 
consists of several terms, each term describing a complete 
crossing of categorical variables and so each term corre- 
sponding to a post-stratification. Then, a necessary condi- 
tion for (A1) to be satisfied is that empty post-strata in the 
sample correspond to empty post-strata in the population for 
each of these terms. Unfortunately, this condition is not 
sufficient. For example, inconsistencies may still occur 
when we attempt to calibrate on a number of complete 
crossings larger than the sample size. 

The assumption is less straightforward in case of 
consistent weighting between persons and households (see 
e.g. Lemaitre and Dufour 1987). This is due to the redef- 
inition of the auxiliary variable. For example, if x, is a 
variable defined at the person level, and from this variable 
a new variable is defined on the household level, say z,, 
then (A1) should be defined in terms of Z, =(Z,,...,Z,)' 
instead of X,, ie., (Al) is satisfied if there exists an 
n-vector W such that Zw = t.. In many (regular) situa- 
tions, the linear fbnitale spanned by Z, will coincide with 
the linear manifold spanned by X.,. In such situations the 
method of Lemaitre and Dufour does not affect the validity 
of (A1). However, in specific cases this may not be true. 
The following simplified example illustrates this. 

Let x, denote sex of the k-th person, say x, = (0, i) if 
the k-th person is a female and x aiid 0)’ if the k-thisa 
male. According to the method of Lemaitre and Dufour 
(1987), let z, denote the j-th household mean for x, when- 
ever k belongs to the j-th household. Furthermore, let the 
population consists of N, males and N, females, from 
which a sample of 10 households is drawn. Suppose that 
each sampled household consists of two persons, namely 
one male and one female. This gives z, = (1/2, 1/2)’ forall ke S. 
For this example the linear manifold spanned by Z, is a 
linear subspace of the linear manifold spanned by X.,. If 
N, =N, then (A1) is satisfied. Otherwise, if N, # N, then 
(Al) is not satisfied. Especially, when the method of 
Lemaitre and Dufour is applied on a relatively large 
weighting model, the linear manifold spanned by Z, may 
be a proper subspace of the linear manifold spanned by X... 
Then, (A1) only is satisfied if t, accidentally belongs to 
this subspace. 


4. CALCULATING THE REGRESSION 
WEIGHTS IN BASCULA 


In the previous section we have ms that the general 
regression weights w, =n, + 2y re ,G,(t,-—X,,) are 
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invariant to the choice of G,. In this section we show how 
to compute these weights. To do so, we start with the 
Cholesky decomposition of the positive (semi) definite 
matrix x A, X,, see Seber (1977, page 322). If X, is of 
full rank, aca xi A,X, is positive definite and it can be 
expressed ease in tie form X‘ s A,X, =U'U, where U 
is an upper triangular matrix with positive diagonal 
elements. Let a,; denote the ij-th element of X5A,X,, then 
U can be computed, row by row, according to 


and 


i-1 (3) 
es » Uy; Uy; 
(ipa Sao Snr oT Oa eal .p: 
Ui; 
If X, has rank r < p, then an application of (3) will give r 
non-zero and p - r zero diagonal elements of U. If we find 
a zero diagonal element then we put its corresponding row 
and column elements at zero. Subsequently, by elementary 
row and column interchanges, we obtain the following 


upper triangular matrix: 


U, 0 
RE Ps poe 
Accordingly to the elementary row and column inter- 


changes we also interchange the elements of X, and 
(t. -X,,,): X,E’ =(X,, X,,) and 


_| (ty. 7 Xa) 
E(t, = Xu) | (t,, p | 


where, by construction, X,, is of full rank and E is a 

non-singular matrix of order p x p. But, since 

U;'(U;) 0 
0 0 


-1 
(Xis AsX5) 90 
0 0 


is a generalized inverse of (X,, Ae a (X,,X,5), we 
have that G,=E’G(E is a generalized inverse of 


Xs A, X,. teereine this generalized inverse into w, = 
na UxiG, (t, -X,,,) gives 
33 _ (t), — Xr) 
W, = +h, (Xz Xx) Gs 
(t,, — Xr) 


- 7 
= Ti +h, iy Lbi5 (U;) (ti. —Xpr)s 


which is computed as follows. First z = (U; eee = Xie) 
is computed by solving the lower triangular system 

Zz =(t,, —X,y,7)- Thereafter u = U, z is computed by 
solving the upper triangular system U,u=z. Once 
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: =I se 
u=U, '(Ul) (t,, — X,7) is computed it is a simple matter 
to compute w,. 


5. THE DUTCH LABOUR FORCE SURVEY 


To illustrate some of the issues stated in this paper, we 
briefly discuss the weighting model of the Dutch Labour 
Force Survey (LFS) of 1987 up to 2000. The target popu- 
lation of this survey consisted of the non-institutional popu- 
lation residing in the Netherlands and its sampling design 
was based on a stratified three-stage sampling with 
households as ultimate sampling units. For details we refer 
to Nieuwenbroek and Van der Valk (1996). Five categorical 
variables were involved into the weighting model, namely 
Sex (2 categories), Age (12 categories), Marital Status (2 
categories), Region (15 categories), and Nationality (2 
categories). Mainly based on consistency requirements, the 
desired weighting model was 


Sex x Age x MaritalStatus x Region x Nationality. 


However, this weighting model resulted in too many small 
cell counts, which gave unstable estimators. Therefore, the 
reduced model 


(Sex x Age x MaritalStatus x Region) 
+ (Sex x Age* x Region x Nationality) 


was used instead, where Age* (2 categories) was obtained 
by grouping the categories in Age. This reduced weighting 
model resulted in a design matrix not of full rank for two 
reasons, namely 1) some columns of the design matrix 
completely consisted of zeros due to impossible combina- 
tions of the categorical variables and 2) there were linear 
combinations between the columns of the design matrix. 

Now, the first kind of redundancy can be easily traced. 
If such columns are found, then their corresponding popu- 
lation totals should be zero. Bascula carries out a check on 
this condition. The second kind of redundancy is more 
difficult to trace. Linear combinations between columns 
may arise because one variable is incorporated into several 
weighting terms. For example, sex and region appear in 
both weighting terms of the LFS weighting model. The 
resulting linear combinations can be recognized beforehand 
by the name of the variable. For the age-variable, which 
also appears in both weighting terms, such a redundancy 
check beforehand is less obvious. These latter kinds of 
redundancy are traced by means of the Cholesky decompo- 
sition. Naturally, if any linear combinations are found, 
either by name beforehand or by the Cholesky decompo- 
sition, then the same linear combinations should also exists 
between the vector of population totals. Bascula also checks 
this condition. 
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